:mod:`metadata` --- Information about data structure ==================================================== .. module:: metadata :synopsis: data structure, field type details. While working with structured data it is helpful to know how the structure looks like, what are the fields, what are their types. Field types ----------- There are two kinds of field types: storage type and analytical type. The storage type specifies how the value is being stored in the source, the type is normalized. Another type is analytical type which is used in data mining, defines if the field can be used by particular algorithm and how the field is treated by mining algorithms. **Storage types** .. list-table:: :header-rows: 1 :widths: 15 80 * - Storage Type - Description * - `string` - names, labels, short descriptions; mostly implemeted as ``VARCHAR`` type in database, or can be found as CSV file fields * - `text` - longer texts, long descriptions, articles * - `integer` - discrete values * - `float` - numerical value with floating point * - `boolean` - binary value, mostly implemented as small integer * - `date` - calendar date representation **Analytical types** .. list-table:: :header-rows: 1 :widths: 15 80 * - Analytical Type - Description * - `set` - Values represent categories, like colors or contract. types. Fields of this type might be numbers which represent for example group numbers, but have no mathematical interpretation. For example addition of group numbers 1+2 has no meaning. * - `ordered_set` - Similar to `set` field type, but values can be ordered in a meaningful order. * - `discrete` - Set of integers - values can be ordered and one can perform arithmetic operations on them, such as: 1 contract + 2 contracts = 3 contracts * - `flag` - Special case of `set` type where values can be one of two types, such as 1 or 0, 'yes' or 'no', 'true' or 'false'. * - `range` - Numerical value, such as financial amount, temperature * - `default` - Analytical type is not explicitly set and default type for fields storage type is used. Refer to the table of default types. * - `typeless` - Field has no analytical relevance. Default analytical types: * `integer` is `discrete` * `float` is `range` * `unknown`, `string`, `text`, `date` are typeless Fields and Field Lists ---------------------- Main metadata class is ``Field`` which gives information about name, types and other useful data attributes. Field might represent a database column in a SQL database, a key in a dictionary-like record... .. autoclass:: brewery.metadata.Field In most cases we are dealing with structured data here, therefore we are working with multiple fields and values at once. For that purpose there is ``FieldList`` – ordered list of field descriptions: Fields can be compared using ``==`` and ``!=`` operators. They are equal if all attributes are equal. Getting a string representation ``str(field)`` of a field returns field name. .. code-block:: python name .. autoclass:: brewery.metadata.FieldList In addition, the FieldList behaves as a list: implements ``len()``, ``del``, ``[]`` with field index, ``+=`` for appending fields, ``+`` for creating new field list by concatenating two other lists. Field lists are used in data sources, data targets, processing streams, nodes, ... They are mostly present in the form of a ``fields`` attribute (in a class) or function parameter with the same name. To make it easy to quickly construct list of fields with all necessary metadata you can do: .. code-block:: python import brewery.metadata as metadata fields = metadata.FieldList(["organisation", "address", "type", "amount"]) If you are implementing a function that changes data structure, do not change the fields you have received from the source. Make a copy and do modifications in the copy: .. code-block:: python import brewery.streams class AppendTimestampNode(streams.Node): def initialize(self): # Create a copy fields = self.input.fields.copy() # Append custom field(s) timestamp_field = Field("timestamp", storage_type = "date") fields.append(timestamp_field) self.output_fields = fields Concrete storage type --------------------- Each field can have specified `concrete storage type` - closest type definition to the real storage. Value of this attribute is dependent on a backend providing field information about data source or data target. For example, SQL backend can use type class or type class instance. Reason for storing concrete storage type is to preserve the type in homogenous environment in the first place. Second reason is to allow custom mappings between backend data types. Brewery does not perform any mapping currently. If the backends are not compatible, the concrete storage is simply ignored and default type from normalized plain ``storage_type`` is used. Field mapping ------------- Quite common operation is field renaming and dropping of unused fields, for example those that were already transformed. This might be also called field filtering. .. autoclass:: brewery.metadata.FieldMap For example our requirement is to do following field mapping/filtering: .. image:: field_map.png :align: center :width: 400px .. code-block:: python import brewery.metadata as metadata fields = metadata.FieldList(["d_org", "d_addr", "type", "amount"]) map = metadata.FieldMap(rename = {"d_org": "organisation", "d_addr":"address"}, drop = ["type"]) mapped_fields = map.map(fields) print(mapped_fields.names()) # Now we have mapped_fields = ['organisation', 'address', 'amount'] To apply field mapping onto a row (list, tuple), there is ``RowFieldFilter``. Following example shows how to filter fields from list of rows: .. code-block:: python # Assume that we have rows with structure specified in previous example in ``fields`` filter = map.row_filter(fields) output = [] for row in rows: output.append(filter.filter(row)) # Output will contain only fields as in ``mapped_fields`` from the previous example .. autoclass:: brewery.metadata.RowFieldFilter