Data Quality

Functions and classes for measuring data quality.

Example of auditing a CSV file:

from brewery import ds
from brewery import dq

# Open a data stream
src = ds.CSVDataSource("data.csv")
src.initialize()

# Prepare field statistics
stats = {}
fields = src.field_names

for field in fields:
    stats[field] = dq.FieldStatistics(field)

record_count = 0

# Probe values
for row in src.rows():
    for i, value in enumerate(row):
        stats[fields[i]].probe(value)

    record_count += 1

# Finalize statistics
for stat in stats.items():
    finalize(record_count)

Auditing using brewery.ds.StreamAuditor:

# ... suppose we have initialized source stream as src

# Create autitor stream target and initialize field list
auditor = ds.StreamAuditor()
auditor.fields = src.fields
auditor.initialize()

# Perform audit for each row from source:
for row in src.rows():
    auditor.append(row)

# Finalize results, close files, etc.
auditor.finalize()

# Get the field statistics
stats = auditor.field_statistics
class brewery.dq.FieldStatistics(key=None, distinct_threshold=10)

Data quality statistics for a dataset field

Attributes :
  • field: name of a field for which statistics are being collected
  • value_count: number of records in which the field exist. In relationad database table this is equal to number of rows, in document based databse, such as MongoDB, it is number of documents that have a key present (being null or not)
  • record_count: total count of records in dataset. This should be set explicitly on finalisation. Seet FieldStatistics.finalize(). In relational database this should be the same as value_count.
  • value_ratio: ratio of value count to record count, 1 for relational databases
  • null_count: number of records where field is null
  • null_value_ratio: ratio of records with nulls to total number of probed values = null_value_ratio / value_count
  • null_record_ratio: ratio of records with nulls to total number of records = null_value_ratio / record_count
  • empty_string_count: number of empty strings
  • storage_types: list of all encountered storage types (CSV, MongoDB, XLS might have different types within a field)
  • unique_storage_type: if there is only one storage type, then this is set to that type
  • distict_values: list of collected distinct values
  • distinct_threshold: number of distict values to collect, if count of distinct values is greather than threshold, collection is stopped and distinct_overflow will be set. Set to 0 to get all values. Default is 10.
dict()

Return dictionary representation of receiver.

finalize(record_count=None)

Compute final statistics.

Parameters :
  • record_count: final number of records in probed dataset.

    See FieldStatistics() for more information.

probe(value)

Probe the value:

  • increase found value count
  • identify storage type
  • probe for null and for empty string
  • probe distinct values: if their count is less than distinct_threshold. If there are more distinct values than the distinct_threshold, then distinct_overflow flag is set and list of distinct values will be empty
class brewery.dq.FieldTypeProbe(field)

Probe for guessing field data type

Attributes:
  • field: name of a field which statistics are being presented
  • storage_types: found storage types
  • unique_storage_type: if there is only one storage type, then this is set to that type
unique_storage_type

Return storage type if there is only one. This should always return a type in relational databases, but does not have to in databases such as MongoDB.

Previous topic

probes — Data Auditing Probes

Next topic

Command Line Tools

This Page