Data Sources¶

Table of Contents

CsvFile
HiveQuery
HBaseTable
JdbcTable
JsonFile
LineFile
Pandas
UploadRows
XmlFile

CsvFile ¶

class trustedanalytics.CsvFile(file_name, schema, delimiter=', ', skip_header_lines=0)¶

Define a CSV file.

Attributes

`field_names`	Schema field names from the CsvFile class.
`field_types`	Schema field types from the CsvFile class.

__init__(file_name, schema, delimiter=', ', skip_header_lines=0)¶

Define a CSV file.

Parameters:

Parameters:	file_name : str The name of the file containing data in a CSV format. The file must be in the Hadoop file system. Relative paths are interpreted as being relative to the path set in the application configuration file. See Configure File System Root. Absolute paths (beginning with `hdfs://...`, for example) are also supported. schema : list of tuples of the form (string, type) A description of the fields of data in the form of a list of tuples, which describe each field. Each tuple is in the form (name, type), where the name is a string, and type is a supported data type, Upon import of the data, the name becomes the name of a column, so the names must be unique and follow column naming rules. For a list of valid data types, see Data Types. The type `ignore` may also be used if the field should be ignored on loads. delimiter : str (optional) A string which indicates the separation of the data fields. This is usually a single character and could be a non-visible character such as a tab. This string must be enclosed by quotes in the command declaration, for example `","`. skip_header_lines : int (optional) An integer for the numbers of lines to skip before parsing records.
Returns:	class A class which holds both the name and schema of a CSV file.

file_name : str

The name of the file containing data in a CSV format. The file must be in the Hadoop file system. Relative paths are interpreted as being relative to the path set in the application configuration file. See Configure File System Root. Absolute paths (beginning with hdfs://..., for example) are also supported.

schema : list of tuples of the form (string, type)

A description of the fields of data in the form of a list of tuples, which describe each field. Each tuple is in the form (name, type), where the name is a string, and type is a supported data type, Upon import of the data, the name becomes the name of a column, so the names must be unique and follow column naming rules. For a list of valid data types, see Data Types. The type ignore may also be used if the field should be ignored on loads.

delimiter : str (optional)

A string which indicates the separation of the data fields. This is usually a single character and could be a non-visible character such as a tab. This string must be enclosed by quotes in the command declaration, for example ",".

skip_header_lines : int (optional)

An integer for the numbers of lines to skip before parsing records.

Returns:

class

A class which holds both the name and schema of a CSV file.

Notes

Unicode characters should not be used in the column name, because some functions do not support them and will not operate properly.

Examples

Given a raw data file named ‘raw_data.csv’, located at ‘hdfs://localhost.localdomain/user/trusted/data/’. It consists of three columns, a, b, and c. The columns have the data types int32, int32, and str respectively. The fields of data are separated by commas. There is no header to the file.

Import the Trusted Analytics Platform:

>>> import trustedanalytics as ta

Define the data:

>>> csv_schema = [("a", ta.int32), ("b", ta.int32), ("c", str)]

Create a CsvFile object with this schema:

>>> csv_define = ta.CsvFile("data/raw_data.csv", csv_schema)

The default delimiter, a comma, was used to separate fields in the file, so it was not specified. If the columns of data were separated by a character other than comma, the appropriate delimiter would be specified. For example if the data columns were separated by the colon character, the instruction would be:

>>> ta.CsvFile("data/raw_data.csv", csv_schema, delimiter = ':')

If the data had some lines of header at the beginning of the file, the lines should be skipped:

>>> csv_data = ta.CsvFile("data/raw_data.csv", csv_schema, skip_header_lines=2)

For other examples see Importing a CSV File.

field_names¶

Schema field names from the CsvFile class.

Returns:

Returns:	list A list of field name strings.

list

A list of field name strings.

Examples

Given a raw data file ‘raw_data.csv’ with columns col1 (int32) and col2 (float32):

>>> csv_class = ta.CsvFile("raw_data.csv", schema=[("col1", ta.int32), ("col2", ta.float32)])
>>> print(csv_class.field_names())

Results:

["col1", "col2"]

field_types¶

Schema field types from the CsvFile class.

Returns:

Returns:	list A list of field types.

list

A list of field types.

Examples

Given a raw data file ‘raw_data.csv’ with columns col1 (int32) and col2 (float32):

>>> csv_class = ta.CsvFile("raw_data.csv", schema = [("col1", ta.int32), ("col2", ta.float32)])
>>> print(csv_class.field_types())

>>> csv_class = ta.CsvFile("raw_data.csv",
... schema=[("col1", ta.int32), ("col2", ta.float32)])
>>> print(csv_class.field_types())

Results:

[ta.int32, ta.float32]

HiveQuery ¶

class trustedanalytics.HiveQuery(query)¶

Define the sql query to retrieve the data from a Hive table.

Methods

__init__(query)¶

Define the sql query to retrieve the data from a Hive table.

Only a subset of Hive data types are supported.

Data Type   Support
___________ ____________________________________

boolean     cast to int

bigint      native support
int         native support
tinyint     cast to int
smallint    cast to int

decimal     cast to double, may lose precision
double      native support
float       native support

date        cast to string
string      native support
timestamp   cast to string
varchar     cast to string

arrays      not supported
binary      not supported
char        not supported
maps        not supported
structs     not supported
union       not supported

Parameters:

Parameters:	query : str The sql query to retrieve the data
Returns:	class : HiveQuery object An object which holds Hive sql query.

query : str

The sql query to retrieve the data

Returns:

class : HiveQuery object

An object which holds Hive sql query.

Examples

Given a Hive table person having name and age among other columns. A simple query could be to get the query for the name and age .. code:

>>> import trustedanalytics as ta
>>> ta.connect()

Define the data:

>>> hive_query = ta.HiveQuery("select name, age from person")

Create a frame using the object:

>>> my_frame = ta.Frame(hive_query)

HBaseTable ¶

class trustedanalytics.HBaseTable(table_name, schema, start_row=None, end_row=None)¶

Define the object to retrieve the data from an hBase table. A note on how we support various overwrite/append scenarios (see below):

create a simple hbase table from csv load csv into a frame using existing frame api save the frame into hbase (it creates a table - lets call it table1)
overwrite existing table with new data do scenario 1 and create table1 load the second csv into a frame save the frame into table1 (old data is gone)
append data to the existing table 1 do scenario 1 and create table1 load table1 into frame1 load csv into frame2 let frame1 = frame1 + frame2 (concatenate frame2 into frame1) save frame1 into base as table1 (overwrite with initial + appended data)

Methods

__init__(table_name, schema, start_row=None, end_row=None)¶

Define the object to retrieve the data from an hBase table.

Parameters:

Parameters:	my_table : str The table name schema : List of (column family, column name, data type for the cell value)
Returns:	class : HBaseTable object An object which holds hBase data.

my_table : str

The table name

schema : List of (column family, column name, data type for the cell value)

Returns:

class : HBaseTable object

An object which holds hBase data.

Examples

>>> import trustedanalytics as ta
>>> ta.connect()
>>> h = ta.HBaseTable ("my_table", [("pants", "aisle", unicode), ("pants", "row", int),( "shirts", "aisle", unicode),("shirts", "row", unicode)])
>>> f = ta.Frame(h)
>>> f.inspect()

JdbcTable ¶

class trustedanalytics.JdbcTable(table_name, connector_type=None, url=None, driver_name=None)¶

Define the object to retrieve the data from an jdbc table.

Methods

__init__(table_name, connector_type=None, url=None, driver_name=None)¶

Define the object to retrieve the data from an jdbc table.

Parameters:

Parameters:	table_name : str the table name connector_type : str the connector type url : str Jdbc connection string (as url) driver_name : str An optional driver name
Returns:	class : JdbcTable object An object which holds jdbc data.

table_name : str

the table name

connector_type : str

the connector type

url : str

Jdbc connection string (as url)

driver_name : str

An optional driver name

Returns:

class : JdbcTable object

An object which holds jdbc data.

Examples

>>> import trustedanalytics as ta
>>> ta.connect()
>>> jdbcTable = ta.JdbcTable ("test",
                              "jdbc:sqlserver://localhost/SQLExpress;databasename=somedatabase;user=someuser;password=somepassord",
                              "com.microsoft.sqlserver.jdbc.SQLServerDriver",
                              "select * FROM SomeTable")
>>> frame = ta.Frame(jdbcTable)
>>> frame.inspect()

JsonFile ¶

class trustedanalytics.JsonFile(file_name)¶

Define a file as having data in JSON format.

__init__(file_name)¶

Define a file as having data in JSON format. When JSON files are loaded into the system all top level JSON objects are recorded into the frame as seperate elements.

Parameters:

Parameters:	file_name : str Name of data input file. File must be in the Hadoop file system. Relative paths are interpreted relative to the `trustedanalytics.atk.engine.fs.root` configuration. Absolute paths (beginning with `hdfs://...`, for example) are also supported. See Configure File System Root.
Returns:	class An object which holds both the name and tag of a JSON file.

file_name : str

Name of data input file. File must be in the Hadoop file system. Relative paths are interpreted relative to the trustedanalytics.atk.engine.fs.root configuration. Absolute paths (beginning with hdfs://..., for example) are also supported. See Configure File System Root.

Returns:

class

An object which holds both the name and tag of a JSON file.

Examples

Given a raw data file named ‘raw_data.json’ located at ‘hdfs://localhost.localdomain/user/trusted/data/’. It consists of a 3 top level json objects with a single value each called obj. Each object contains the attributes color, size, and shape.

The example JSON file:

{ "obj": {
    "color": "blue",
    "size": 3,
    "shape": "square" }
}
{ "obj": {
    "color": "green",
    "size": 7,
    "shape": "triangle" }
}
{ "obj": {
    "color": "orange",
    "size": 10,
    "shape": "square" }
}

Import the Trusted Analytics Platform:

>>> import trustedanalytics as ta
>>> ta.connect()

Define the data:

>>> json_file = ta.JsonFile("data/raw_data.json")

Create a frame using this JsonFile:

>>> my_frame = ta.Frame(json_file)

The frame looks like:

  data_lines
/------------------------/
  '{ "obj": {
      "color": "blue",
      "size": 3,
      "shape": "square" }
  }'
  '{ "obj": {
      "color": "green",
      "size": 7,
      "shape": "triangle" }
  }'
  '{ "obj": {
      "color": "orange",
      "size": 10,
      "shape": "square" }
  }'

Parse values out of the XML column using the add_columns method:

>>> def parse_my_json(row):
...     import json
...     my_json = json.loads(row[0])
...     obj = my_json['obj']
...     return (obj['color'], obj['size'], obj['shape'])

>>> my_frame.add_columns(parse_my_json, [("color", str), ("size", str),
... ("shape", str)])

Original XML column is no longer necessary:

>>> my_frame.drop_columns(['data_lines'])

Result:

>>> my_frame.inspect()

  color:str   size:str    shape:str
/-----------------------------------/
  blue        3           square
  green       7           triangle
  orange      10          square

LineFile ¶

class trustedanalytics.LineFile(file_name)¶

Define a line-separated file.

__init__(file_name)¶

Define a line-separated file.

Parameters:

Parameters:	file_name : str Name of data input file. File must be in the Hadoop file system. Relative paths are interpreted relative to the `trustedanalytics.atk.engine.fs.root` configuration. Absolute paths (beginning with `hdfs://...`, for example) are also supported. See Configure File System Root.
Returns:	class A class which holds the name of a Line File.

file_name : str

Name of data input file. File must be in the Hadoop file system. Relative paths are interpreted relative to the trustedanalytics.atk.engine.fs.root configuration. Absolute paths (beginning with hdfs://..., for example) are also supported. See Configure File System Root.

Returns:

class

A class which holds the name of a Line File.

Examples

Given a raw data file ‘rawline_data.txt’ located at ‘hdfs://localhost.localdomain/user/trusted/data/’. It consists of multiple lines separated by new line character.

Import the Trusted Analytics Platform:

>>> import trustedanalytics as ta
>>> ta.connect()

Define the data:

>>> linefile_class = ta.LineFile("data/rawline_data.txt")

Pandas ¶

class trustedanalytics.Pandas(pandas_frame, schema, row_index=True)¶

Defines a pandas data source

Attributes

`field_names`	Schema field names.
`field_types`	Schema field types

__init__(pandas_frame, schema, row_index=True)¶

Defines a pandas data source

Parameters:

Parameters:	pandas_frame : a pandas dataframe object schema : list of tuples of the form (string, type) schema description of the fields for a given line. It is a list of tuples which describe each field, (field name, field type), where the field name is a string, and file is a supported type, (See data_types from the atktypes module). Unicode characters should not be used in the column name. row_index : boolean (optional) indicates if the row_index is present in the pandas dataframe and needs to be ignored when looking at the data values. Default value is True.
Returns:	class An object which holds both the pandas dataframe and schema associated with it.

pandas_frame : a pandas dataframe object

schema : list of tuples of the form (string, type)

schema description of the fields for a given line. It is a list of tuples which describe each field, (field name, field type), where the field name is a string, and file is a supported type, (See data_types from the atktypes module). Unicode characters should not be used in the column name.

row_index : boolean (optional)

indicates if the row_index is present in the pandas dataframe and needs to be ignored when looking at the data values. Default value is True.

Returns:

class

An object which holds both the pandas dataframe and schema associated with it.

Examples

For this example, we are going to create a 0-5 ratings system with corresponding descriptions. It consists of two columns, rating number and rating description. The columns have the data types, int32 and string.

First import trustedanalytics and pandas:

import trustedanalytics as ta
import pandas

Connect:

ta.connect()

Create data:

ratings_data = [[0, "invalid"], [1, "Very Poor"], [2, "Poor"], [3, "Average"], [4, "Good"], [5, "Very Good"]]
df = pandas.DataFrame(ratings_data, columns=['rating_id', 'rating_text'])

At this point create a schema that defines the data:

schema = [('rating_id', ta.int32), ('rating_text', unicode)]

Now build a PandasFrame object with this schema:

ratings= ta.Frame(ta.Pandas(df, schema))

To check the result::: ratings.inspect()

field_names¶

Schema field names.

List of field names from the schema stored in the trustedanalytics pandas dataframe object

Returns:

Returns:	list of string Field names

list of string

Field names

Examples

For this example, we are going to use a pandas dataframe object your_pandas . It will have two columns col1 and col2 with types of int32 and float32 respectively:

my_pandas = ta.PandasFrame(your_pandas, schema=[("col1", ta.int32), ("col2", ta.float32)])
print(my_pandas.field_names())

The output would be:

["col1", "col2"]

field_types¶

Schema field types

List of field types from the schema stored in the trustedanalytics pandas dataframe object.

Returns:

Returns:	list of types Field types

list of types

Field types

Examples

For this example, we are going to use a pandas dataframe object your_pandas . It will have two columns col1 and col2 with types of int32 and float32 respectively:

my_pandas = ta.PandasFrame(your_pandas, schema=[("col1", ta.int32), ("col2", ta.float32)])
print(my_csv.field_types())

The output would be:

[numpy.int32, numpy.float32]

UploadRows ¶

class trustedanalytics.UploadRows(data, schema)¶

Raw data source for upload: list of lists + schema

__init__(data, schema)¶

Data source to upload raw list data

Parameters:

Parameters:	data: list List of lists, where each item in the list represents a raw row data schema: list of tuples List of tuples (column_name, data_type)

data: list

List of lists, where each item in the list represents a raw row data

schema: list of tuples

List of tuples (column_name, data_type)

Examples

>>> import trustedanalytics as ta
>>> data = [[1, 'one', [1.0, 1.1]], [2, 'two', [2.0, 2.2]], [3, 'three', [3.0, 3.3]]]
>>> schema = [('n', int), ('s', str), ('v', ta.vector(2))]
>>> frame = ta.Frame(ta.UploadRows(data, schema))
>>> frame.inspect()
[#]  n      s           v
=========================
[0]  1  one    [1.0, 1.1]
[1]  2  two    [2.0, 2.2]
[2]  3  three  [3.0, 3.3]

XmlFile ¶

class trustedanalytics.XmlFile(file_name, tag_name)¶

Define an file as having data in XML format.

__init__(file_name, tag_name)¶

Define an file as having data in XML format.

When XML files are loaded into the system, individual records are separated into the highest level elements found with the specified tag name and placed into a column called data_lines.

Parameters:

Parameters:	file_name : str Name of data input file. File must be in the Hadoop file system. Relative paths are interpreted relative to the `trustedanalytics.atk.engine.fs.root` configuration. Absolute paths (beginning with `hdfs://...`, for example) are also supported. See Configure File System Root. tag_name : str Tag name used to determine the split of elements into separate records.
Returns:	class An object which holds both the name and tag of a XML file.

file_name : str

Name of data input file. File must be in the Hadoop file system. Relative paths are interpreted relative to the trustedanalytics.atk.engine.fs.root configuration. Absolute paths (beginning with hdfs://..., for example) are also supported. See Configure File System Root.

tag_name : str

Tag name used to determine the split of elements into separate records.

Returns:

class

An object which holds both the name and tag of a XML file.

Examples

Given a raw data file named ‘raw_data.xml’ located at ‘hdfs://localhost.localdomain/user/trusted/data/’. It consists of a root element called shapes with subelements with the tag names square and triangle. Each of these subelements has two potential subelements called name and size. One of the elements has an attribute called color. Additionally, the subelement triangle is not needed so we can skip it during the import.

The example XML file:

<?xml version="1.0" encoding="UTF-8"?>
<shapes>
    <square>
        <name>left</name>
        <size>3</size>
    </square>
    <triangle>
        <size>3</size>
    </triangle>
    <square color="blue">
        <name>right</name>
        <size>5</size>
    </square>
</shapes>

Import the Trusted Analytics Platform:

>>> import trustedanalytics as ta
>>> ta.connect()

Define the data:

>>> xml_file = ta.XmlFile("data/raw_data.xml", "square")

Create a frame using this XmlFile:

>>> my_frame = ta.Frame(xml_file)

The frame looks like:

  data_lines
/------------------------/
  '<square>
        <name>left</name>
        <size>3</size>
   </square>'
  '<square color="blue">
        <name>right</name>
        <size>5</size>
   </square>'

Parse values out of the XML column using the add_columns method:

>>> def parse_my_xml(row):
...     import xml.etree.ElementTree as ET
...     ele = ET.fromstring(row[0])
...     return (ele.get("color"), ele.find("name").text, ele.find("size").text)

>>> my_frame.add_columns(parse_my_xml, [("color", str), ("name", str), ("size", str)])

Original XML column is no longer necessary:

>>> my_frame.drop_columns(['data_lines'])

Result:

>>> my_frame.inspect()

  color:str   name:str    size:str
/----------------------------------/
  None        left        3
  blue        right       5

Quick search

Data Sources¶