Data Quality
Functions and classes for measuring data quality.
Example of auditing a CSV file:
from brewery import ds
from brewery import dq
# Open a data stream
src = ds.CSVDataSource("data.csv")
src.initialize()
# Prepare field statistics
stats = {}
fields = src.field_names
for field in fields:
stats[field] = dq.FieldStatistics(field)
record_count = 0
# Probe values
for row in src.rows():
for i, value in enumerate(row):
stats[fields[i]].probe(value)
record_count += 1
# Finalize statistics
for stat in stats.items():
finalize(record_count)
Auditing using brewery.ds.StreamAuditor:
# ... suppose we have initialized source stream as src
# Create autitor stream target and initialize field list
auditor = ds.StreamAuditor()
auditor.fields = src.fields
auditor.initialize()
# Perform audit for each row from source:
for row in src.rows():
auditor.append(row)
# Finalize results, close files, etc.
auditor.finalize()
# Get the field statistics
stats = auditor.field_statistics
-
class brewery.dq.FieldStatistics(key=None, distinct_threshold=10)
Data quality statistics for a dataset field
Attributes : |
- field: name of a field for which statistics are being collected
- value_count: number of records in which the field exist. In relationad database table this
is equal to number of rows, in document based databse, such as MongoDB, it is number of
documents that have a key present (being null or not)
- record_count: total count of records in dataset. This should be set explicitly on
finalisation. Seet FieldStatistics.finalize(). In relational database this should be the
same as value_count.
- value_ratio: ratio of value count to record count, 1 for relational databases
- null_count: number of records where field is null
- null_value_ratio: ratio of records with nulls to total number of probed values =
null_value_ratio / value_count
- null_record_ratio: ratio of records with nulls to total number of records =
null_value_ratio / record_count
- empty_string_count: number of empty strings
- storage_types: list of all encountered storage types (CSV, MongoDB, XLS might have different
types within a field)
- unique_storage_type: if there is only one storage type, then this is set to that type
- distict_values: list of collected distinct values
- distinct_threshold: number of distict values to collect, if count of distinct values is
greather than threshold, collection is stopped and distinct_overflow will be set. Set to 0
to get all values. Default is 10.
|
-
dict()
Return dictionary representation of receiver.
-
finalize(record_count=None)
Compute final statistics.
Parameters : |
- record_count: final number of records in probed dataset.
See FieldStatistics() for more information.
|
-
probe(value)
Probe the value:
- increase found value count
- identify storage type
- probe for null and for empty string
- probe distinct values: if their count is less than distinct_threshold. If there are more
distinct values than the distinct_threshold, then distinct_overflow flag is set and list
of distinct values will be empty
-
class brewery.dq.FieldTypeProbe(field)
Probe for guessing field data type
- Attributes:
- field: name of a field which statistics are being presented
- storage_types: found storage types
- unique_storage_type: if there is only one storage type, then this is set to that type
-
unique_storage_type
Return storage type if there is only one. This should always return a type in relational
databases, but does not have to in databases such as MongoDB.