Data Sources¶
Table of Contents
CsvFile¶
-
class
trustedanalytics.
CsvFile
(file_name, schema, delimiter=', ', skip_header_lines=0)¶ Define a CSV file.
Attributes
field_names
Schema field names from the CsvFile class. field_types
Schema field types from the CsvFile class. -
__init__
(file_name, schema, delimiter=', ', skip_header_lines=0)¶ Define a CSV file.
Parameters: file_name : str
The name of the file containing data in a CSV format. The file must be in the Hadoop file system. Relative paths are interpreted as being relative to the path set in the application configuration file. See Configure File System Root. Absolute paths (beginning with
hdfs://...
, for example) are also supported.schema : list of tuples of the form (string, type)
A description of the fields of data in the form of a list of tuples, which describe each field. Each tuple is in the form (name, type), where the name is a string, and type is a supported data type, Upon import of the data, the name becomes the name of a column, so the names must be unique and follow column naming rules. For a list of valid data types, see Data Types. The type
ignore
may also be used if the field should be ignored on loads.delimiter : str (optional)
A string which indicates the separation of the data fields. This is usually a single character and could be a non-visible character such as a tab. This string must be enclosed by quotes in the command declaration, for example
","
.skip_header_lines : int (optional)
An integer for the numbers of lines to skip before parsing records.
Returns: class
A class which holds both the name and schema of a CSV file.
Notes
Unicode characters should not be used in the column name, because some functions do not support them and will not operate properly.
Examples
Given a raw data file named ‘raw_data.csv’, located at ‘hdfs://localhost.localdomain/user/trusted/data/’. It consists of three columns, a, b, and c. The columns have the data types int32, int32, and str respectively. The fields of data are separated by commas. There is no header to the file.
Import the Trusted Analytics Platform:
>>> import trustedanalytics as ta
Define the data:
>>> csv_schema = [("a", ta.int32), ("b", ta.int32), ("c", str)]
Create a CsvFile object with this schema:
>>> csv_define = ta.CsvFile("data/raw_data.csv", csv_schema)
The default delimiter, a comma, was used to separate fields in the file, so it was not specified. If the columns of data were separated by a character other than comma, the appropriate delimiter would be specified. For example if the data columns were separated by the colon character, the instruction would be:
>>> ta.CsvFile("data/raw_data.csv", csv_schema, delimiter = ':')
If the data had some lines of header at the beginning of the file, the lines should be skipped:
>>> csv_data = ta.CsvFile("data/raw_data.csv", csv_schema, skip_header_lines=2)
For other examples see Importing a CSV File.
-
field_names
¶ Schema field names from the CsvFile class.
Returns: list
A list of field name strings.
Examples
Given a raw data file ‘raw_data.csv’ with columns col1 (int32) and col2 (float32):
>>> csv_class = ta.CsvFile("raw_data.csv", schema=[("col1", ta.int32), ("col2", ta.float32)]) >>> print(csv_class.field_names())
Results:
["col1", "col2"]
-
field_types
¶ Schema field types from the CsvFile class.
Returns: list
A list of field types.
Examples
Given a raw data file ‘raw_data.csv’ with columns col1 (int32) and col2 (float32):
>>> csv_class = ta.CsvFile("raw_data.csv", schema = [("col1", ta.int32), ("col2", ta.float32)]) >>> print(csv_class.field_types())
>>> csv_class = ta.CsvFile("raw_data.csv", ... schema=[("col1", ta.int32), ("col2", ta.float32)]) >>> print(csv_class.field_types())
Results:
[ta.int32, ta.float32]
-
HiveQuery¶
-
class
trustedanalytics.
HiveQuery
(query)¶ Define the sql query to retrieve the data from a Hive table.
Methods
-
__init__
(query)¶ Define the sql query to retrieve the data from a Hive table.
Only a subset of Hive data types are supported.
Data Type Support ___________ ____________________________________ boolean cast to int bigint native support int native support tinyint cast to int smallint cast to int decimal cast to double, may lose precision double native support float native support date cast to string string native support timestamp cast to string varchar cast to string arrays not supported binary not supported char not supported maps not supported structs not supported union not supported
Parameters: query : str
The sql query to retrieve the data
Returns: class : HiveQuery object
An object which holds Hive sql query.
Examples
Given a Hive table person having name and age among other columns. A simple query could be to get the query for the name and age .. code:
>>> import trustedanalytics as ta >>> ta.connect()
Define the data:
>>> hive_query = ta.HiveQuery("select name, age from person")
Create a frame using the object:
>>> my_frame = ta.Frame(hive_query)
-
HBaseTable¶
-
class
trustedanalytics.
HBaseTable
(table_name, schema, start_row=None, end_row=None)¶ Define the object to retrieve the data from an hBase table. A note on how we support various overwrite/append scenarios (see below):
- create a simple hbase table from csv load csv into a frame using existing frame api save the frame into hbase (it creates a table - lets call it table1)
- overwrite existing table with new data do scenario 1 and create table1 load the second csv into a frame save the frame into table1 (old data is gone)
- append data to the existing table 1 do scenario 1 and create table1 load table1 into frame1 load csv into frame2 let frame1 = frame1 + frame2 (concatenate frame2 into frame1) save frame1 into base as table1 (overwrite with initial + appended data)
Methods
-
__init__
(table_name, schema, start_row=None, end_row=None)¶ Define the object to retrieve the data from an hBase table.
Parameters: my_table : str
The table name
schema : List of (column family, column name, data type for the cell value)
Returns: class : HBaseTable object
An object which holds hBase data.
Examples
>>> import trustedanalytics as ta >>> ta.connect() >>> h = ta.HBaseTable ("my_table", [("pants", "aisle", unicode), ("pants", "row", int),( "shirts", "aisle", unicode),("shirts", "row", unicode)]) >>> f = ta.Frame(h) >>> f.inspect()
JdbcTable¶
-
class
trustedanalytics.
JdbcTable
(table_name, connector_type=None, url=None, driver_name=None)¶ Define the object to retrieve the data from an jdbc table.
Methods
-
__init__
(table_name, connector_type=None, url=None, driver_name=None)¶ Define the object to retrieve the data from an jdbc table.
Parameters: table_name : str
the table name
connector_type : str
the connector type
url : str
Jdbc connection string (as url)
driver_name : str
An optional driver name
Returns: class : JdbcTable object
An object which holds jdbc data.
Examples
>>> import trustedanalytics as ta >>> ta.connect() >>> jdbcTable = ta.JdbcTable ("test", "jdbc:sqlserver://localhost/SQLExpress;databasename=somedatabase;user=someuser;password=somepassord", "com.microsoft.sqlserver.jdbc.SQLServerDriver", "select * FROM SomeTable") >>> frame = ta.Frame(jdbcTable) >>> frame.inspect()
-
JsonFile¶
-
class
trustedanalytics.
JsonFile
(file_name)¶ Define a file as having data in JSON format.
-
__init__
(file_name)¶ Define a file as having data in JSON format. When JSON files are loaded into the system all top level JSON objects are recorded into the frame as seperate elements.
Parameters: file_name : str
Name of data input file. File must be in the Hadoop file system. Relative paths are interpreted relative to the
trustedanalytics.atk.engine.fs.root
configuration. Absolute paths (beginning withhdfs://...
, for example) are also supported. See Configure File System Root.Returns: class
An object which holds both the name and tag of a JSON file.
Examples
Given a raw data file named ‘raw_data.json’ located at ‘hdfs://localhost.localdomain/user/trusted/data/’. It consists of a 3 top level json objects with a single value each called obj. Each object contains the attributes color, size, and shape.
The example JSON file:
{ "obj": { "color": "blue", "size": 3, "shape": "square" } } { "obj": { "color": "green", "size": 7, "shape": "triangle" } } { "obj": { "color": "orange", "size": 10, "shape": "square" } }
Import the Trusted Analytics Platform:
>>> import trustedanalytics as ta >>> ta.connect()
Define the data:
>>> json_file = ta.JsonFile("data/raw_data.json")
Create a frame using this JsonFile:
>>> my_frame = ta.Frame(json_file)
The frame looks like:
data_lines /------------------------/ '{ "obj": { "color": "blue", "size": 3, "shape": "square" } }' '{ "obj": { "color": "green", "size": 7, "shape": "triangle" } }' '{ "obj": { "color": "orange", "size": 10, "shape": "square" } }'
Parse values out of the XML column using the add_columns method:
>>> def parse_my_json(row): ... import json ... my_json = json.loads(row[0]) ... obj = my_json['obj'] ... return (obj['color'], obj['size'], obj['shape']) >>> my_frame.add_columns(parse_my_json, [("color", str), ("size", str), ... ("shape", str)])
Original XML column is no longer necessary:
>>> my_frame.drop_columns(['data_lines'])
Result:
>>> my_frame.inspect() color:str size:str shape:str /-----------------------------------/ blue 3 square green 7 triangle orange 10 square
-
LineFile¶
-
class
trustedanalytics.
LineFile
(file_name)¶ Define a line-separated file.
-
__init__
(file_name)¶ Define a line-separated file.
Parameters: file_name : str
Name of data input file. File must be in the Hadoop file system. Relative paths are interpreted relative to the
trustedanalytics.atk.engine.fs.root
configuration. Absolute paths (beginning withhdfs://...
, for example) are also supported. See Configure File System Root.Returns: class
A class which holds the name of a Line File.
Examples
Given a raw data file ‘rawline_data.txt’ located at ‘hdfs://localhost.localdomain/user/trusted/data/’. It consists of multiple lines separated by new line character.
Import the Trusted Analytics Platform:
>>> import trustedanalytics as ta >>> ta.connect()
Define the data:
>>> linefile_class = ta.LineFile("data/rawline_data.txt")
-
Pandas¶
-
class
trustedanalytics.
Pandas
(pandas_frame, schema, row_index=True)¶ Defines a pandas data source
Attributes
field_names
Schema field names. field_types
Schema field types -
__init__
(pandas_frame, schema, row_index=True)¶ Defines a pandas data source
Parameters: pandas_frame : a pandas dataframe object
schema : list of tuples of the form (string, type)
schema description of the fields for a given line. It is a list of tuples which describe each field, (field name, field type), where the field name is a string, and file is a supported type, (See data_types from the atktypes module). Unicode characters should not be used in the column name.
row_index : boolean (optional)
indicates if the row_index is present in the pandas dataframe and needs to be ignored when looking at the data values. Default value is True.
Returns: class
An object which holds both the pandas dataframe and schema associated with it.
Examples
For this example, we are going to create a 0-5 ratings system with corresponding descriptions. It consists of two columns, rating number and rating description. The columns have the data types, int32 and string.
First import trustedanalytics and pandas:
import trustedanalytics as ta import pandas
Connect:
ta.connect()
Create data:
ratings_data = [[0, "invalid"], [1, "Very Poor"], [2, "Poor"], [3, "Average"], [4, "Good"], [5, "Very Good"]] df = pandas.DataFrame(ratings_data, columns=['rating_id', 'rating_text'])
At this point create a schema that defines the data:
schema = [('rating_id', ta.int32), ('rating_text', unicode)]
Now build a PandasFrame object with this schema:
ratings= ta.Frame(ta.Pandas(df, schema))
- To check the result::
- ratings.inspect()
-
field_names
¶ Schema field names.
List of field names from the schema stored in the trustedanalytics pandas dataframe object
Returns: list of string
Field names
Examples
For this example, we are going to use a pandas dataframe object your_pandas . It will have two columns col1 and col2 with types of int32 and float32 respectively:
my_pandas = ta.PandasFrame(your_pandas, schema=[("col1", ta.int32), ("col2", ta.float32)]) print(my_pandas.field_names())
The output would be:
["col1", "col2"]
-
field_types
¶ Schema field types
List of field types from the schema stored in the trustedanalytics pandas dataframe object.
Returns: list of types
Field types
Examples
For this example, we are going to use a pandas dataframe object your_pandas . It will have two columns col1 and col2 with types of int32 and float32 respectively:
my_pandas = ta.PandasFrame(your_pandas, schema=[("col1", ta.int32), ("col2", ta.float32)]) print(my_csv.field_types())
The output would be:
[numpy.int32, numpy.float32]
-
UploadRows¶
-
class
trustedanalytics.
UploadRows
(data, schema)¶ Raw data source for upload: list of lists + schema
-
__init__
(data, schema)¶ Data source to upload raw list data
Parameters: data: list
List of lists, where each item in the list represents a raw row data
schema: list of tuples
List of tuples (column_name, data_type)
Examples
>>> import trustedanalytics as ta >>> data = [[1, 'one', [1.0, 1.1]], [2, 'two', [2.0, 2.2]], [3, 'three', [3.0, 3.3]]] >>> schema = [('n', int), ('s', str), ('v', ta.vector(2))] >>> frame = ta.Frame(ta.UploadRows(data, schema)) >>> frame.inspect() [#] n s v ========================= [0] 1 one [1.0, 1.1] [1] 2 two [2.0, 2.2] [2] 3 three [3.0, 3.3]
-
XmlFile¶
-
class
trustedanalytics.
XmlFile
(file_name, tag_name)¶ Define an file as having data in XML format.
-
__init__
(file_name, tag_name)¶ Define an file as having data in XML format.
When XML files are loaded into the system, individual records are separated into the highest level elements found with the specified tag name and placed into a column called data_lines.
Parameters: file_name : str
Name of data input file. File must be in the Hadoop file system. Relative paths are interpreted relative to the
trustedanalytics.atk.engine.fs.root
configuration. Absolute paths (beginning withhdfs://...
, for example) are also supported. See Configure File System Root.tag_name : str
Tag name used to determine the split of elements into separate records.
Returns: class
An object which holds both the name and tag of a XML file.
Examples
Given a raw data file named ‘raw_data.xml’ located at ‘hdfs://localhost.localdomain/user/trusted/data/’. It consists of a root element called shapes with subelements with the tag names square and triangle. Each of these subelements has two potential subelements called name and size. One of the elements has an attribute called color. Additionally, the subelement triangle is not needed so we can skip it during the import.
The example XML file:
<?xml version="1.0" encoding="UTF-8"?> <shapes> <square> <name>left</name> <size>3</size> </square> <triangle> <size>3</size> </triangle> <square color="blue"> <name>right</name> <size>5</size> </square> </shapes>
Import the Trusted Analytics Platform:
>>> import trustedanalytics as ta >>> ta.connect()
Define the data:
>>> xml_file = ta.XmlFile("data/raw_data.xml", "square")
Create a frame using this XmlFile:
>>> my_frame = ta.Frame(xml_file)
The frame looks like:
data_lines /------------------------/ '<square> <name>left</name> <size>3</size> </square>' '<square color="blue"> <name>right</name> <size>5</size> </square>'
Parse values out of the XML column using the add_columns method:
>>> def parse_my_xml(row): ... import xml.etree.ElementTree as ET ... ele = ET.fromstring(row[0]) ... return (ele.get("color"), ele.find("name").text, ele.find("size").text) >>> my_frame.add_columns(parse_my_xml, [("color", str), ("name", str), ("size", str)])
Original XML column is no longer necessary:
>>> my_frame.drop_columns(['data_lines'])
Result:
>>> my_frame.inspect() color:str name:str size:str /----------------------------------/ None left 3 blue right 5
-