Data Sources

CsvFile

class trustedanalytics.CsvFile(file_name, schema, delimiter=', ', skip_header_lines=0)

Define a CSV file.

Attributes

field_names Schema field names from the CsvFile class.
field_types Schema field types from the CsvFile class.
__init__(file_name, schema, delimiter=', ', skip_header_lines=0)

Define a CSV file.

Parameters:

file_name : str

The name of the file containing data in a CSV format. The file must be in the Hadoop file system. Relative paths are interpreted as being relative to the path set in the application configuration file. See Configure File System Root. Absolute paths (beginning with hdfs://..., for example) are also supported.

schema : list of tuples of the form (string, type)

A description of the fields of data in the form of a list of tuples, which describe each field. Each tuple is in the form (name, type), where the name is a string, and type is a supported data type, Upon import of the data, the name becomes the name of a column, so the names must be unique and follow column naming rules. For a list of valid data types, see Data Types. The type ignore may also be used if the field should be ignored on loads.

delimiter : str (optional)

A string which indicates the separation of the data fields. This is usually a single character and could be a non-visible character such as a tab. This string must be enclosed by quotes in the command declaration, for example ",".

skip_header_lines : int (optional)

An integer for the numbers of lines to skip before parsing records.

Returns:

class

A class which holds both the name and schema of a CSV file.

Notes

Unicode characters should not be used in the column name, because some functions do not support them and will not operate properly.

Examples

Given a raw data file named ‘raw_data.csv’, located at ‘hdfs://localhost.localdomain/user/trusted/data/’. It consists of three columns, a, b, and c. The columns have the data types int32, int32, and str respectively. The fields of data are separated by commas. There is no header to the file.

Import the Trusted Analytics Platform:

>>> import trustedanalytics as ta

Define the data:

>>> csv_schema = [("a", ta.int32), ("b", ta.int32), ("c", str)]

Create a CsvFile object with this schema:

>>> csv_define = ta.CsvFile("data/raw_data.csv", csv_schema)

The default delimiter, a comma, was used to separate fields in the file, so it was not specified. If the columns of data were separated by a character other than comma, the appropriate delimiter would be specified. For example if the data columns were separated by the colon character, the instruction would be:

>>> ta.CsvFile("data/raw_data.csv", csv_schema, delimiter = ':')

If the data had some lines of header at the beginning of the file, the lines should be skipped:

>>> csv_data = ta.CsvFile("data/raw_data.csv", csv_schema, skip_header_lines=2)

For other examples see Importing a CSV File.

field_names

Schema field names from the CsvFile class.

Returns:

list

A list of field name strings.

Examples

Given a raw data file ‘raw_data.csv’ with columns col1 (int32) and col2 (float32):

>>> csv_class = ta.CsvFile("raw_data.csv", schema=[("col1", ta.int32), ("col2", ta.float32)])
>>> print(csv_class.field_names())

Results:

["col1", "col2"]
field_types

Schema field types from the CsvFile class.

Returns:

list

A list of field types.

Examples

Given a raw data file ‘raw_data.csv’ with columns col1 (int32) and col2 (float32):

>>> csv_class = ta.CsvFile("raw_data.csv", schema = [("col1", ta.int32), ("col2", ta.float32)])
>>> print(csv_class.field_types())
>>> csv_class = ta.CsvFile("raw_data.csv",
... schema=[("col1", ta.int32), ("col2", ta.float32)])
>>> print(csv_class.field_types())

Results:

[ta.int32, ta.float32]

HiveQuery

class trustedanalytics.HiveQuery(query)

Define the sql query to retrieve the data from a Hive table.

Methods

__init__(query)

Define the sql query to retrieve the data from a Hive table.

Only a subset of Hive data types are supported.

Data Type   Support
___________ ____________________________________

boolean     cast to int

bigint      native support
int         native support
tinyint     cast to int
smallint    cast to int

decimal     cast to double, may lose precision
double      native support
float       native support

date        cast to string
string      native support
timestamp   cast to string
varchar     cast to string

arrays      not supported
binary      not supported
char        not supported
maps        not supported
structs     not supported
union       not supported
Parameters:

query : str

The sql query to retrieve the data

Returns:

class : HiveQuery object

An object which holds Hive sql query.

Examples

Given a Hive table person having name and age among other columns. A simple query could be to get the query for the name and age .. code:

>>> import trustedanalytics as ta
>>> ta.connect()

Define the data:

>>> hive_query = ta.HiveQuery("select name, age from person")

Create a frame using the object:

>>> my_frame = ta.Frame(hive_query)

HBaseTable

class trustedanalytics.HBaseTable(table_name, schema, start_row=None, end_row=None)

Define the object to retrieve the data from an hBase table. A note on how we support various overwrite/append scenarios (see below):

  1. create a simple hbase table from csv load csv into a frame using existing frame api save the frame into hbase (it creates a table - lets call it table1)
  2. overwrite existing table with new data do scenario 1 and create table1 load the second csv into a frame save the frame into table1 (old data is gone)
  3. append data to the existing table 1 do scenario 1 and create table1 load table1 into frame1 load csv into frame2 let frame1 = frame1 + frame2 (concatenate frame2 into frame1) save frame1 into base as table1 (overwrite with initial + appended data)

Methods

__init__(table_name, schema, start_row=None, end_row=None)

Define the object to retrieve the data from an hBase table.

Parameters:

my_table : str

The table name

schema : List of (column family, column name, data type for the cell value)

Returns:

class : HBaseTable object

An object which holds hBase data.

Examples

>>> import trustedanalytics as ta
>>> ta.connect()
>>> h = ta.HBaseTable ("my_table", [("pants", "aisle", unicode), ("pants", "row", int),( "shirts", "aisle", unicode),("shirts", "row", unicode)])
>>> f = ta.Frame(h)
>>> f.inspect()

JdbcTable

class trustedanalytics.JdbcTable(table_name, connector_type=None, url=None, driver_name=None)

Define the object to retrieve the data from an jdbc table.

Methods

__init__(table_name, connector_type=None, url=None, driver_name=None)

Define the object to retrieve the data from an jdbc table.

Parameters:

table_name : str

the table name

connector_type : str

the connector type

url : str

Jdbc connection string (as url)

driver_name : str

An optional driver name

Returns:

class : JdbcTable object

An object which holds jdbc data.

Examples

>>> import trustedanalytics as ta
>>> ta.connect()
>>> jdbcTable = ta.JdbcTable ("test",
                              "jdbc:sqlserver://localhost/SQLExpress;databasename=somedatabase;user=someuser;password=somepassord",
                              "com.microsoft.sqlserver.jdbc.SQLServerDriver",
                              "select * FROM SomeTable")
>>> frame = ta.Frame(jdbcTable)
>>> frame.inspect()

JsonFile

class trustedanalytics.JsonFile(file_name)

Define a file as having data in JSON format.

__init__(file_name)

Define a file as having data in JSON format. When JSON files are loaded into the system all top level JSON objects are recorded into the frame as seperate elements.

Parameters:

file_name : str

Name of data input file. File must be in the Hadoop file system. Relative paths are interpreted relative to the trustedanalytics.atk.engine.fs.root configuration. Absolute paths (beginning with hdfs://..., for example) are also supported. See Configure File System Root.

Returns:

class

An object which holds both the name and tag of a JSON file.

Examples

Given a raw data file named ‘raw_data.json’ located at ‘hdfs://localhost.localdomain/user/trusted/data/’. It consists of a 3 top level json objects with a single value each called obj. Each object contains the attributes color, size, and shape.

The example JSON file:

{ "obj": {
    "color": "blue",
    "size": 3,
    "shape": "square" }
}
{ "obj": {
    "color": "green",
    "size": 7,
    "shape": "triangle" }
}
{ "obj": {
    "color": "orange",
    "size": 10,
    "shape": "square" }
}

Import the Trusted Analytics Platform:

>>> import trustedanalytics as ta
>>> ta.connect()

Define the data:

>>> json_file = ta.JsonFile("data/raw_data.json")

Create a frame using this JsonFile:

>>> my_frame = ta.Frame(json_file)

The frame looks like:

  data_lines
/------------------------/
  '{ "obj": {
      "color": "blue",
      "size": 3,
      "shape": "square" }
  }'
  '{ "obj": {
      "color": "green",
      "size": 7,
      "shape": "triangle" }
  }'
  '{ "obj": {
      "color": "orange",
      "size": 10,
      "shape": "square" }
  }'

Parse values out of the XML column using the add_columns method:

>>> def parse_my_json(row):
...     import json
...     my_json = json.loads(row[0])
...     obj = my_json['obj']
...     return (obj['color'], obj['size'], obj['shape'])

>>> my_frame.add_columns(parse_my_json, [("color", str), ("size", str),
... ("shape", str)])

Original XML column is no longer necessary:

>>> my_frame.drop_columns(['data_lines'])

Result:

>>> my_frame.inspect()

  color:str   size:str    shape:str
/-----------------------------------/
  blue        3           square
  green       7           triangle
  orange      10          square

LineFile

class trustedanalytics.LineFile(file_name)

Define a line-separated file.

__init__(file_name)

Define a line-separated file.

Parameters:

file_name : str

Name of data input file. File must be in the Hadoop file system. Relative paths are interpreted relative to the trustedanalytics.atk.engine.fs.root configuration. Absolute paths (beginning with hdfs://..., for example) are also supported. See Configure File System Root.

Returns:

class

A class which holds the name of a Line File.

Examples

Given a raw data file ‘rawline_data.txt’ located at ‘hdfs://localhost.localdomain/user/trusted/data/’. It consists of multiple lines separated by new line character.

Import the Trusted Analytics Platform:

>>> import trustedanalytics as ta
>>> ta.connect()

Define the data:

>>> linefile_class = ta.LineFile("data/rawline_data.txt")

Pandas

class trustedanalytics.Pandas(pandas_frame, schema, row_index=True)

Defines a pandas data source

Attributes

field_names Schema field names.
field_types Schema field types
__init__(pandas_frame, schema, row_index=True)

Defines a pandas data source

Parameters:

pandas_frame : a pandas dataframe object

schema : list of tuples of the form (string, type)

schema description of the fields for a given line. It is a list of tuples which describe each field, (field name, field type), where the field name is a string, and file is a supported type, (See data_types from the atktypes module). Unicode characters should not be used in the column name.

row_index : boolean (optional)

indicates if the row_index is present in the pandas dataframe and needs to be ignored when looking at the data values. Default value is True.

Returns:

class

An object which holds both the pandas dataframe and schema associated with it.

Examples

For this example, we are going to create a 0-5 ratings system with corresponding descriptions. It consists of two columns, rating number and rating description. The columns have the data types, int32 and string.

First import trustedanalytics and pandas:

import trustedanalytics as ta
import pandas

Connect:

ta.connect()

Create data:

ratings_data = [[0, "invalid"], [1, "Very Poor"], [2, "Poor"], [3, "Average"], [4, "Good"], [5, "Very Good"]]
df = pandas.DataFrame(ratings_data, columns=['rating_id', 'rating_text'])

At this point create a schema that defines the data:

schema = [('rating_id', ta.int32), ('rating_text', unicode)]

Now build a PandasFrame object with this schema:

ratings= ta.Frame(ta.Pandas(df, schema))
To check the result::
ratings.inspect()
field_names

Schema field names.

List of field names from the schema stored in the trustedanalytics pandas dataframe object

Returns:

list of string

Field names

Examples

For this example, we are going to use a pandas dataframe object your_pandas . It will have two columns col1 and col2 with types of int32 and float32 respectively:

my_pandas = ta.PandasFrame(your_pandas, schema=[("col1", ta.int32), ("col2", ta.float32)])
print(my_pandas.field_names())

The output would be:

["col1", "col2"]
field_types

Schema field types

List of field types from the schema stored in the trustedanalytics pandas dataframe object.

Returns:

list of types

Field types

Examples

For this example, we are going to use a pandas dataframe object your_pandas . It will have two columns col1 and col2 with types of int32 and float32 respectively:

my_pandas = ta.PandasFrame(your_pandas, schema=[("col1", ta.int32), ("col2", ta.float32)])
print(my_csv.field_types())

The output would be:

[numpy.int32, numpy.float32]

UploadRows

class trustedanalytics.UploadRows(data, schema)

Raw data source for upload: list of lists + schema

__init__(data, schema)

Data source to upload raw list data

Parameters:

data: list

List of lists, where each item in the list represents a raw row data

schema: list of tuples

List of tuples (column_name, data_type)

Examples

>>> import trustedanalytics as ta
>>> data = [[1, 'one', [1.0, 1.1]], [2, 'two', [2.0, 2.2]], [3, 'three', [3.0, 3.3]]]
>>> schema = [('n', int), ('s', str), ('v', ta.vector(2))]
>>> frame = ta.Frame(ta.UploadRows(data, schema))
>>> frame.inspect()
[#]  n      s           v
=========================
[0]  1  one    [1.0, 1.1]
[1]  2  two    [2.0, 2.2]
[2]  3  three  [3.0, 3.3]

XmlFile

class trustedanalytics.XmlFile(file_name, tag_name)

Define an file as having data in XML format.

__init__(file_name, tag_name)

Define an file as having data in XML format.

When XML files are loaded into the system, individual records are separated into the highest level elements found with the specified tag name and placed into a column called data_lines.

Parameters:

file_name : str

Name of data input file. File must be in the Hadoop file system. Relative paths are interpreted relative to the trustedanalytics.atk.engine.fs.root configuration. Absolute paths (beginning with hdfs://..., for example) are also supported. See Configure File System Root.

tag_name : str

Tag name used to determine the split of elements into separate records.

Returns:

class

An object which holds both the name and tag of a XML file.

Examples

Given a raw data file named ‘raw_data.xml’ located at ‘hdfs://localhost.localdomain/user/trusted/data/’. It consists of a root element called shapes with subelements with the tag names square and triangle. Each of these subelements has two potential subelements called name and size. One of the elements has an attribute called color. Additionally, the subelement triangle is not needed so we can skip it during the import.

The example XML file:

<?xml version="1.0" encoding="UTF-8"?>
<shapes>
    <square>
        <name>left</name>
        <size>3</size>
    </square>
    <triangle>
        <size>3</size>
    </triangle>
    <square color="blue">
        <name>right</name>
        <size>5</size>
    </square>
</shapes>

Import the Trusted Analytics Platform:

>>> import trustedanalytics as ta
>>> ta.connect()

Define the data:

>>> xml_file = ta.XmlFile("data/raw_data.xml", "square")

Create a frame using this XmlFile:

>>> my_frame = ta.Frame(xml_file)

The frame looks like:

  data_lines
/------------------------/
  '<square>
        <name>left</name>
        <size>3</size>
   </square>'
  '<square color="blue">
        <name>right</name>
        <size>5</size>
   </square>'

Parse values out of the XML column using the add_columns method:

>>> def parse_my_xml(row):
...     import xml.etree.ElementTree as ET
...     ele = ET.fromstring(row[0])
...     return (ele.get("color"), ele.find("name").text, ele.find("size").text)

>>> my_frame.add_columns(parse_my_xml, [("color", str), ("name", str), ("size", str)])

Original XML column is no longer necessary:

>>> my_frame.drop_columns(['data_lines'])

Result:

>>> my_frame.inspect()

  color:str   name:str    size:str
/----------------------------------/
  None        left        3
  blue        right       5