Warning
This module will be very likely renamed from ds to ‘stores’. Currently there is confusion whether ‘ds’ means ‘data stores’ or ‘data streams’. There is also another module called streams: processing streams based on nodes connected with pipes.
Data stores provide interface for common way of reading from and writing to various structured data stores through structured data streams. It allows you to read CSV file and merge it with Excel spreadsheet or Google spreadsheet, then perform cleansing and write it to a relational database table or create a report.
The data streams can be compared to file-like stream objects where structured data is being passed instead of bytes. There are two ways how to look at structured data: as a set of lists of values or as a set of key-value pairs (set of dictionaries). Some sources provide one or the other way of looking at the data, some processing is better with the list form, another might be better with the dictionary form. Brewery allows you to use the form which is most suitable for you.
At any time you are able to retrieve stream metadata: list of fields being streamed. For more information see metadata where you can find more information.
Data source | Description | Dataset reference |
---|---|---|
csv | Comma separated values (CSV) file/URI resource | file path, file-like object, URL |
xls | MS Excel spreadsheet | file path, URL |
gdoc | Google Spreadsheet | spreadsheet key or name |
sql | Relational database table | connection + table name |
mongodb | MongoDB database collection | connection + table name |
yamldir | Directory containing yaml files - one file per record | directory |
elasticsearch | Elastic Search – Open Source, Distributed, RESTful, Search Engine |
Data sources should implement:
Should provide property fields, optionally might provide assignment of this property.
Data target | Description |
---|---|
csv | Comma separated values (CSV) file/URI resource |
sql | Relational database table |
mongodb | MongoDB database collection |
yamldir | Directory containing yaml files - one file per record |
jsondir | Directory containing json files - one file per record (not yet) |
html | HTML file or a string target |
elasticsearch | Elastic Search – Open Source, Distributed, RESTful, Search Engine |
Data targets should implement:
Use these classes as super classes for your custom structured data sources or data targets.
A data stream object – abstract class.
The subclasses should provide:
fields are FieldList objects representing fields passed through the receiving stream - either read from data source (DataSource.rows()) or written to data target (DataTarget.append()).
Subclasses should populate the fields property (or implenet an accessor).
The subclasses might override:
The class supports context management, for example:
with ds.CSVDataSource("output.csv") as s:
for row in s.rows():
print row
In this case, the initialize() and finalize() methods are called automatically.
Subclasses might put finalisation code here, for example:
Default implementation does nothing.
Delayed stream initialisation code. Subclasses might override this method to implement file or handle opening, connecting to a database, doing web authentication, ... By default this method does nothing.
The method does not take any arguments, it expects pre-configured object.
Abstrac class for data sources.
Read field descriptions from data source. You should use this for datasets that do not provide metadata directly, such as CSV files, document bases databases or directories with structured files. Does nothing in relational databases, as fields are represented by table columns and table metadata can obtained from database easily.
Note that this method can be quite costly, as by default all records within dataset are read and analysed.
After executing this method, stream fields is set to the newly read field list and may be configured (set more appropriate data types for example).
Arguments : |
|
---|
Returns: tuple with Field objects. Order of fields is datastore adapter specific.
Return iterable object with dict objects. This is one of two methods for reading from data source. Subclasses should implement this method.
Return iterable object with tuples. This is one of two methods for reading from data source. Subclasses should implement this method.
Abstrac class for data targets.
Append an object into dataset. Object can be a tuple, array or a dict object. If tuple or array is used, then value position should correspond to field position in the field list, if dict is used, the keys should be valid field names.
Creates a CSV data source stream.
Attributes : |
|
---|
Note: avoid auto-detection when you are reading from remote URL stream.
Initialize CSV source stream:
#. detect whether CSV has headers from a sample data (if requested)
create CSV reader object
read CSV headers if requested and initialize stream fields
If fields are explicitly set prior to initialization, and header reading is requested, then the header row is just skipped and fields that were set before are used. Do not set fields if you want to read the header.
All fields are set to storage_type = string and analytical_type = unknown.
Creates a Google Spreadsheet data source stream.
Attributes : |
|
---|
You should provide either spreadsheet_key or spreadsheet_name, if more than one spreadsheet with given name are found, then the first in list returned by Google is used.
For worksheet selection you should provide either worksheet_id or worksheet_name. If more than one worksheet with given name are found, then the first in list returned by Google is used. If no worksheet_id nor worksheet_name are provided, then first worksheet in the workbook is used.
For details on query string syntax see the section on sq under http://code.google.com/apis/spreadsheets/reference.html#list_Parameters
Connect to the Google documents, authenticate.
Creates a XLS spreadsheet data source stream.
Attributes : |
|
---|
Initialize XLS source stream:
Creates a relational database data source stream.
Attributes : |
|
---|
Initialize source stream. If the fields are not initialized, then they are read from the table.
Creates a MongoDB data source stream.
Attributes : |
|
---|
Initialize Mongo source stream:
Creates a YAML directory data source stream.
The data source reads files from a directory and treats each file as single record. For example, following directory will contain 3 records:
data/
contract_0.yml
contract_1.yml
contract_2.yml
Optionally one can specify a field where file name will be stored.
Attributes : |
|
---|
Creates a CSV data target
Attributes : |
|
---|
Creates a relational database data target stream.
Attributes : |
|
---|
Note: avoid auto-detection when you are reading from remote URL stream.
Closes the stream, flushes buffered data
Initialize source stream:
Creates a MongoDB data target stream.
Attributes : |
|
---|
Initialize Mongo source stream:
Creates a directory data target with YAML files as records.
Attributes : |
|
---|
Target stream for auditing data values from stream. For more information about probed value properties, please refer to brewery.dq.FieldStatistics
Probe row or record and update statistics.
Return field statistics as dictionary: keys are field names, values are brewery.dq.FieldStatistics objects
Creates a HTML data target with simple naive HTML generation. No package that generates document node tree is used, just plain string concatenation.
Attributes : |
|
---|
Note: No HTML escaping is done. HTML tags in data might break the output.