Discovery¶

Tools to discover patterns in mixed data.

The module provides functions that allow to:

analyze the storage and extract high-level information such as field frequency and common structures;
pick best document class for a dictionary.

dark.discovery.suggest_document_class(data, classes, fit_whole_data=False, require_schema=False)¶

Returns the best matching document class from given list of classes for given data dictionary. If no class matched the data, returns None.

Parameters:

Parameters:	data – a dictionary. classes – a list of `docu.Document` subclasses. require_schema – If True, classes with empty schemata are discarded. Default is False because a document may have no schema but strict validators. fit_whole_data – if True, partial structural matches are discarded. Default is False, i.e. it is OK for the document class to only support a subset of fields found in the data dictionary.

data – a dictionary.
classes – a list of docu.Document subclasses.
require_schema – If True, classes with empty schemata are discarded. Default is False because a document may have no schema but strict validators.
fit_whole_data – if True, partial structural matches are discarded. Default is False, i.e. it is OK for the document class to only support a subset of fields found in the data dictionary.

The guess is based on: a) how well does the declared structure match given dictionary, and b) does the resulting document validate or not. The classes are sorted by structure similarity and then the first valid choice is picked.

dark.discovery.field_frequency(query, having=None, raw=False)¶

Returns a list of pairs (field name, frequency) sorted by frequency in given query (most frequent field is listed first).

See also print_field_frequency().

dark.discovery.print_field_frequency(*args, **kwargs)¶: Prints nicely formatted output of field_frequency().

dark.discovery.suggest_structures(query, having=None)¶

Analyses all documents in given database and returns a list of unique structures found. The usefullness of the result depends on the database: for a highly irregular database a very large and almost useless list can be generated, while a more or less regular database can be easily inspected by this function.

Parameters:	having – list of field names that must be present in each structure.

Usage example:

from docu import *
from dark import *

db = get_db(backend='docu.ext.tokyo_cabinet', path='foo.tct')
for structure in suggest_structures(db):
    print structure
    doc_cls = document_factory(structure)
    print '  %d entries' % doc_cls.objects(db).count()

See also print_suggest_structures().

dark.discovery.print_suggest_structures(*args, **kwargs)¶: Prints nicely formatted output of suggest_structures().

dark.discovery.document_factory(structure, all_required=True)¶

Returns a docu.document_base.Document subclass for given structure including validators. Please note that each field will get the unicode data type. If you need a more precise

Parameters:	structure – a list of keys. Any iterable will do. If it is a dictionary, only its keys will be used. all_required – If True (default), the all fields in the structure are considered mandatory and validator Required is added for each of them.

Here’s a use case. Say, we have a dictionary data and we need to find all documents with same fields:

# make a document class with vaidators
CustomDocument = document_factory(data)
# the validators will generate a query
similar_docs = CustomDocument.objects(storage)

Note that this does not yield documents with exactly the same structure: it is only guaranteed that all fields present in data are also present in these records. Neither does this method guarantee that the data types would match.

Dark v0.6.0 documentation

Discovery

Discovery¶