Command Line Tools¶

brewery¶

Tool for performing brewery framework functionality from command line.

Usage:

brewery command [command_options]

Commands are:

Command	Description
`run`	Run a stream
`graph`	Generate graphviz structure from stream

`run`¶

Example:

brewery run stream.json

The json file should contain a dictionary with nodes and connections.

`graph`¶

Generate a graphviz graph structure.

Example:

brewery run stream.json > graph.dot
dot -o graph.png -T png out.dot

`nodes`¶

List available nodes. If a node name is specified, then node information, including list of node attributes is displayed.

Example:

brewery nodes
brewery nodes csv_source

`pipe`¶

Create and run non-branched pipe stream. Each argument is either a node or a node attribute. The attribute has form attribute_name=value. There should be at least one node defined. If there is no source node, then CSV on standard input is assumed. if there is no target node, then CSV on standard output is assumed.

Example - audit a CSV:

cat data.csv | brewery pipe audit

Make output nicer:

cat data.csv | brewery pipe audit pretty_printer

Read CSV from a file and store in newly created SQLite database table:

brewery pipe csv_source resource=data.csv \
             sql_table_target \
                url=sqlite:///data.sqlite \
                table=data  \
                create=1 \
                replace=1

Warning

This command is not fully working. There is no type conversion of values, which might cause problems. There is no way to specify non-scalar values (arrays, dictionaries). Some nodes might not have properely implemented attributes, therefore you might get error of non-existing attribute even if the attribute is there.

mongoaudit¶

Audit mongo database collections from data quality perspective.

Usage:

mongoaudit [-h] [-H HOST] [-p PORT] [-t THRESHOLD] [-f {text,json}] database collection

Here is a foo:

Argument	Description
`-h, --help`	show this help message and exit
`-H HOST, --host HOST`	host with MongoDB server
`-p PORT, --port PORT`	port where MongoDB server is listening
`-t THRESHOLD, --threshold THRESHOLD`	threshold for number of distinct values (default is 10)
`-f {text,json}, --format {text,json}`	output format (default is text)

The threshold is number of distict values to collect, if distinct values is greather than threshold, no more values are being collected and distinct_overflow will be set. Set to 0 to get all values. Default is 10.

Measured values¶

Probe	Description
field	name of a field which statistics are being presented
record_count	total count of records in dataset
value_count	number of records in which the field exist. In RDB table this is equal to record_count, in document based databse, such as MongoDB it is number of documents that have a key present (being null or not)
value_ratio	ratio of value count to record count, 1 for relational databases
null_record_ratio	ratio of null value count to record count
null_value_ratio	ratio of null value count to present value count (same as null_record_ration for relational databases)
null_count	number of records where field is null
null_value_ratio	ratio of records
unique_storage_type	if there is only one storage type, then this is set to that type
distinct_threshold

Example output¶

Text output:

flow:
    storage type: unicode
    present values: 1257 (10.09%)
    null: 0 (0.00% of records, 0.00% of values)
    empty strings: 0
    distinct values:
            'spending'
            'income'
pdf_link:
    storage type: unicode
    present values: 22 (95.65%)
    null: 0 (0.00% of records, 0.00% of values)
    empty strings: 0

JSon output:

{ ...
    "pdf_link" : {
       "unique_storage_type" : "unicode",
       "value_ratio" : 0.956521739130435,
       "distinct_overflow" : [
          true
       ],
       "key" : "pdf_link",
       "null_value_ratio" : 0,
       "null_record_ratio" : 0,
       "record_count" : 23,
       "storage_types" : [
          "unicode"
       ],
       "distinct_values" : [],
       "empty_string_count" : 0,
       "null_count" : 0,
       "value_count" : 22
    },
    ...
}

Note

This tool will change into generic data source auditing tool and will support all datastores that brewery will support, such as relational databases or plain structured files.

Command Line Tools¶

brewery¶

`run`¶

`graph`¶

`nodes`¶

`pipe`¶

mongoaudit¶

Measured values¶

Example output¶

Table Of Contents

Previous topic

Next topic

This Page

Navigation

Command Line Tools¶

brewery¶

run¶

graph¶

nodes¶

pipe¶

mongoaudit¶

Measured values¶

Example output¶

Table Of Contents

Previous topic

Next topic

This Page

Quick search

Navigation

`run`¶

`graph`¶

`nodes`¶

`pipe`¶