.. _api-index: ================= API Documentation ================= :Release: |version| :Date: |today| This is the API documentation for wormtable. The documentation currently concentrates the read-API, since the initial release is intended primarily for use with VCF data. For details on how to build a wormtable from a VCF file see the :ref:`tutorial `. In the :ref:`api-examples` section we take an informal tour of the API using a small example table. The :ref:`api-reference` section provides concrete API documentation for the :mod:`wormtable` module. .. _api-examples: --------- Examples --------- .. py:currentmodule:: wormtable To illustrate the :mod:`wormtable` API, we use a wormtable `pythons.wt` which contains the following data: ============== ==== ====== ===== ======== ======== name born writer actor director producer ============== ==== ====== ===== ======== ======== John Cleese 1939 60 127 0 43 Terry Gilliam 1940 25 24 18 8 Eric Idle 1943 38 74 7 5 Terry Jones 1942 50 49 16 1 Michael Palin 1943 58 56 0 1 Graham Chapman 1941 46 24 0 2 ============== ==== ====== ===== ======== ======== This table consists of six columns: the name of the Python, the year they were born and the number of entries in `IMDB `_ they have under these headings as of 2013. ###### Tables ###### The :func:`open_table` function returns a :class:`Table` object opened for reading, and is analogous to the :func:`open` function from the Python standard library. So, to open our `pythons.wt` table for reading, we might do the following:: >>> import wormtable as wt >>> t = wt.open_table("pythons.wt") >>> len(t) 6 Tables that are opened should be closed when they are no longer needed. This is done using the :meth:`Table.close` method, again analogous to Python file handling. Trying to access a closed table results in an error:: >>> t.close() >>> len(t) Traceback (most recent call last): File "", line 1, in File "./wormtable.py", line 637, in __len__ self.verify_open() File "./wormtable.py", line 447, in verify_open raise ValueError("Database must be opened") ValueError: Database must be opened The :class:`Table` class also supports the `context manager `_ protocol, so we can automatically close a table that has been opened:: with wt.open_table("pythons.wt") as t: print(len(t)) # t is now closed and cannot be accessed The :class:`Table` class supports the read-only Python sequence protocol, and so tables can be treated like a two-dimensional list in many ways. For example:: >>> t = wt.open_table("pythons.wt") >>> t[0] (0, b'John Cleese', 1939, 60, 127, 0, 43) >>> t[-1] (5, b'Graham Chapman', 1941, 46, 24, 0, 2) >>> t[4:] [(4, b'Michael Palin', 1943, 58, 56, 0, 1), (5, b'Graham Chapman', 1941, 46, 24, 0, 2)] Rows are returned as tuples, with values for each column occupying the corresponding position. Each table consists of a fixed number of columns, which describe the size and type of the data in the column. The :class:`Column` class has some methods to query these types and sizes, and these are accessed either via the :meth:`Table.columns` method, or the :meth:`Table.get_column` method:: >>> [c.get_name() for c in t.columns()] ['row_id', 'name', 'born', 'writer', 'actor', 'director', 'producer'] >>> c = t.get_column("born") >>> (c.get_type_name(), c.get_element_size()) ('uint', 2) This tells us that the ``born`` column holds unsigned integer data with an element size of 2, and so it can store values from 0 to 65534. See :ref:`data-types-index` for details on the various data types and sizes supported by wormtable. The first column in every wormtable is an unsigned integer column, called ``row_id``. This is the column used to index rows, and the size of this column determines the number of rows that can be stored in the table. As a result, we always have ``t[j][0] == j``:: >>> [t[j][0] for j in range(len(t))] [0, 1, 2, 3, 4, 5] ####### Cursors ####### Suppose we are only interested in the name and the birth year of the pythons. We could do something like:: >>> t = wt.open_table("pythons.wt") >>> [(r[1], r[2]) for r in t] [(b'John Cleese', 1939), (b'Terry Gilliam', 1940), (b'Eric Idle', 1943), (b'Terry Jones', 1942), (b'Michael Palin', 1943), (b'Graham Chapman', 1941)] This is very inefficient if we have a large number of columns, because :mod:`wormtable` must build a tuple containing all of the columns in each row, even though most of this will not be used. It is also inconvenient: we must remember that the `name` column is in position 1, and the `born` column is in position 2. A much more efficient and convenient approach is to use a *cursor*. Cursors provide a simple means of iterating over rows, and retrieving values for a given set of columns. Repeating the example above:: >>> [r for r in t.cursor(["name", "born"])] [(b'John Cleese', 1939), (b'Terry Gilliam', 1940), (b'Eric Idle', 1943), (b'Terry Jones', 1942), (b'Michael Palin', 1943), (b'Graham Chapman', 1941)] The :meth:`Table.cursor` method returns an iterator over the rows in a table for a list of columns. (Cursors are intended to be used over very large datasets, and so we would not usually construct a list of the rows.) The :meth:`Table.cursor` method also provides a way to restrict the rows retrieved from the table using the *start* and *stop* arguments (this is analogous to the built in :func:`range` function). For example, to only retrieve rows 1 to 3, we would do the following (rows are zero-indexed in wormtable):: >>> [r for r in t.cursor(["name", "born"], start=1, stop=4)] [(b'Terry Gilliam', 1940), (b'Eric Idle', 1943), (b'Terry Jones', 1942)] Note that *start* is **inclusive** and *stop* is **exclusive**. ############## Simple Indexes ############## Suppose we wished to rank the Monty Python team in terms of writing credits from IMDB. We could simply retrieve the columns that we are interested in and sort them in terms of the ``writer`` column using the built in :func:`sorted` function. This does not work very well, however, if we have millions of rows in our table. It is very slow, and may not even be possible if there are too many rows to fit in memory. An *index* in wormtable is a persistent sorting of a table with respect to a given column (or list of columns, as we see in the `Compound Indexes`_ section). Indexes are extremely useful, and can be used to make many different operations more efficient. Each index has a *name*, which is its unique identifier. Indexes are created using the ``wtadmin`` command line tool. To open an index on a table, we use the :meth:`Table.open_index` method. For example, to open an index called ``writer``, we might use:: >>> i = t.open_index("writer") The :meth:`Table.open_index` method is directly analogous to the :func:`open_table` function used to open tables. Indexes should be closed after use, like tables, and also support the `context manager `_ protocol to automatically close indexes:: with t.open_index("writer") as i: print(i.max_key()) # Index i is now closed and cannot be accessed Indexes sort the *keys* in the columns of interest, and map these keys to the rows of the table where they are found. To get the minimum and maximum keys from the index, we use the :meth:`Index.min_key` and :meth:`Index.max_key` methods:: >>> i = t.open_index("writer") >>> (i.min_key(), i.max_key()) (25, 60) This tells us that the least productive Python has 25 writing credits on IMDB, and the most has 60. This does not tell us *who* they are though. To get information about other columns, we must use a *cursor*:: >>> for r in i.cursor(["name", "writer"]): ... print(r) ... (b'Terry Gilliam', 25) (b'Eric Idle', 38) (b'Graham Chapman', 46) (b'Terry Jones', 50) (b'Michael Palin', 58) (b'John Cleese', 60) Just like the :meth:`Table.cursor` method, :meth:`Index.cursor` iterates over rows in the table for a selection of columns. The difference between the two is that the *order* in which the rows are returned is the order defined by the index. The *start* and *stop* arguments to the function are also now in terms of index keys, and not row positions. This gives us a very flexible method of obtaining rows from the table based on the *values* that they contain. For example, if we are only interested in the Pythons who have between 30 (inclusive) and 50 (exclusive) writing credits, we can write:: >>> for r in i.cursor(["name", "writer"], start=30, stop=50): ... print(r) ... (b'Eric Idle', 38) (b'Graham Chapman', 46) ############## Index Counters ############## To find out the number of rows in a table correspond to a given index key, we use a Counter object. This is closely modelled on a the :class:`collections.Counter` class; it is a mapping from keys to the number of rows in the table containing this key. For example, if we make an index on the ``director`` Column:: >>> i = t.open_index("director") >>> c = i.counter() >>> for k, v in c.items(): ... print(k, "->", v) ... 0 -> 3 7 -> 1 16 -> 1 18 -> 1 This shows that there are 3 Pythons who have directed 0 films, and the three others have directed 7, 16 and 18 respectively. Counters implement the read-only Python mapping protocol, and so can be treated very much like a dictionary:: >>> c[0] 3 >>> c[1] 0 >>> len(c) 4 ################ Compound Indexes ################ Wormtable also supports indexes over more than one column. These differ from simple indexes in that the keys for each index are constructed by concatenating the values from the constituent columns, in the order that they are specified. For example, we can make an index on the columns ``director`` and ``producer``, which we call ``director+producer``:: >>> i = t.open_index("director+producer") >>> for r in i.cursor(["name", "director", "producer"]): ... print(r) ... (b'Michael Palin', 0, 1) (b'Graham Chapman', 0, 2) (b'John Cleese', 0, 43) (b'Eric Idle', 7, 5) (b'Terry Jones', 16, 1) (b'Terry Gilliam', 18, 8) This lists the rows in the order defined by the index. Keys are sorted lexicographically, so that we sort on the first column first, and if there are duplicate values for the first column we then sort on the second column. Here, for example, we have Michael Palin, Graham Chapman and John Cleese have all directed 0 films. But since this is a compound index, we then sort on the producer column, giving the ordering that we see. Since keys now contain values from multiple columns, the :meth:`Index.min_key` and :meth:`Index.max_key` now return tuples:: >>> i.min_key() (0, 1) >>> i.max_key() (18, 8) These are also more flexible now, however, as we can get the minimum and maximum keys with a given prefix:: >>> i.min_key(7) (7, 5) >>> i.max_key(0) (0, 43) The *start* and *stop* arguments to the :meth:`Index.cursor` method also support this flexible key prefixing. Suppose we wish to find all the Pythons with at least 7 directorial credits:: >>> [r for r in i.cursor(["name", "director", "producer"], start=7)] [(b'Eric Idle', 7, 5), (b'Terry Jones', 16, 1), (b'Terry Gilliam', 18, 8)] We get the same answer if we specify 5 or less for the ``producer`` column:: >>> [r for r in i.cursor(["name", "director", "producer"], start=(7, 0))] [(b'Eric Idle', 7, 5), (b'Terry Jones', 16, 1), (b'Terry Gilliam', 18, 8)] But, we lose Eric Idle if we require 6 or more production credits:: >>> [r for r in i.cursor(["name", "director", "producer"], start=(7, 6))] [(b'Terry Jones', 16, 1), (b'Terry Gilliam', 18, 8)] .. _api-reference: ---------------- Module reference ---------------- .. module:: wormtable :platform: Unix :synopsis: Write-once read-many table for large datasets. .. autofunction:: open_table #################### :class:`Table` class #################### .. class:: Table The table class represents a wormtable located in a *home directory*. The home directory for a table stores the Berkeley DB databases used to store the rows and indexes, along with some metadata to describe these tables. The files within a home directory are not intended to be accessed directly; modifications to a table should be made through this API only. .. automethod:: cursor .. automethod:: open_index .. automethod:: open .. automethod:: close .. automethod:: columns .. automethod:: get_column #################### :class:`Index` class #################### .. class:: Index Indexes define a *sorting order* of the rows in a table. An index over a set of columns creates a sorting order over the table by concatenating the values from the columns in question together (the *keys*) and storing the mapping of these keys to the rows in the table in which the key occurs. .. automethod:: cursor .. automethod:: Index.open .. automethod:: Index.close .. automethod:: Index.min_key .. automethod:: Index.max_key .. automethod:: Index.keys .. automethod:: Index.counter ##################### :class:`Column` class ##################### .. class:: Column Columns define the storage types for values within a table. .. automethod:: get_name .. automethod:: get_description .. automethod:: get_type .. automethod:: get_type_name .. automethod:: get_element_size .. automethod:: get_num_elements