Data Quality¶

Functions and classes for measuring data quality.

Example of auditing a CSV file:

from brewery import ds
from brewery import dq

# Open a data stream
src = ds.CSVDataSource("data.csv")
src.initialize()

# Prepare field statistics
stats = {}
fields = src.field_names

for field in fields:
    stats[field] = dq.FieldStatistics(field)

record_count = 0

# Probe values
for row in src.rows():
    for i, value in enumerate(row):
        stats[fields[i]].probe(value)

    record_count += 1

# Finalize statistics
for stat in stats.items():
    finalize(record_count)

Auditing using brewery.ds.StreamAuditor:

# ... suppose we have initialized source stream as src

# Create autitor stream target and initialize field list
auditor = ds.StreamAuditor()
auditor.fields = src.fields
auditor.initialize()

# Perform audit for each row from source:
for row in src.rows():
    auditor.append(row)

# Finalize results, close files, etc.
auditor.finalize()

# Get the field statistics
stats = auditor.field_statistics

class brewery.dq.FieldStatistics(key=None, distinct_threshold=10)¶

Data quality statistics for a dataset field

Attributes :

field: name of a field for which statistics are being collected
value_count: number of records in which the field exist. In relationad database table this is equal to number of rows, in document based databse, such as MongoDB, it is number of documents that have a key present (being null or not)
record_count: total count of records in dataset. This should be set explicitly on finalisation. Seet FieldStatistics.finalize(). In relational database this should be the same as value_count.
value_ratio: ratio of value count to record count, 1 for relational databases
null_count: number of records where field is null
null_value_ratio: ratio of records with nulls to total number of probed values = null_value_ratio / value_count
null_record_ratio: ratio of records with nulls to total number of records = null_value_ratio / record_count
empty_string_count: number of empty strings
storage_types: list of all encountered storage types (CSV, MongoDB, XLS might have different types within a field)
unique_storage_type: if there is only one storage type, then this is set to that type
distict_values: list of collected distinct values
distinct_threshold: number of distict values to collect, if count of distinct values is greather than threshold, collection is stopped and distinct_overflow will be set. Set to 0 to get all values. Default is 10.

dict()¶: Return dictionary representation of receiver.

finalize(record_count=None)¶

Compute final statistics.

Parameters :	record_count: final number of records in probed dataset. See `FieldStatistics()` for more information.

probe(value)¶

Probe the value:

increase found value count
identify storage type
probe for null and for empty string
probe distinct values: if their count is less than distinct_threshold. If there are more distinct values than the distinct_threshold, then distinct_overflow flag is set and list of distinct values will be empty

class brewery.dq.FieldTypeProbe(field)¶

Probe for guessing field data type

Attributes:

field: name of a field which statistics are being presented
storage_types: found storage types
unique_storage_type: if there is only one storage type, then this is set to that type

unique_storage_type¶: Return storage type if there is only one. This should always return a type in relational databases, but does not have to in databases such as MongoDB.

Data Quality¶

Previous topic

Next topic

This Page

Navigation

Data Quality¶

Previous topic

Next topic

This Page

Quick search

Navigation