API Documentation

Release:0.1
Date:April 08, 2015

This is the API documentation for wormtable. The documentation currently concentrates the read-API, since the initial release is intended primarily for use with VCF data. For details on how to build a wormtable from a VCF file see the tutorial.

In the Examples section we take an informal tour of the API using a small example table. The Module reference section provides concrete API documentation for the wormtable module.

Examples

To illustrate the wormtable API, we use a wormtable pythons.wt which contains the following data:

name born writer actor director producer
John Cleese 1939 60 127 0 43
Terry Gilliam 1940 25 24 18 8
Eric Idle 1943 38 74 7 5
Terry Jones 1942 50 49 16 1
Michael Palin 1943 58 56 0 1
Graham Chapman 1941 46 24 0 2

This table consists of six columns: the name of the Python, the year they were born and the number of entries in IMDB they have under these headings as of 2013.

Tables

The open_table() function returns a Table object opened for reading, and is analogous to the open() function from the Python standard library. So, to open our pythons.wt table for reading, we might do the following:

>>> import wormtable as wt
>>> t = wt.open_table("pythons.wt")
>>> len(t)
6

Tables that are opened should be closed when they are no longer needed. This is done using the Table.close() method, again analogous to Python file handling. Trying to access a closed table results in an error:

>>> t.close()
>>> len(t)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "./wormtable.py", line 637, in __len__
    self.verify_open()
  File "./wormtable.py", line 447, in verify_open
    raise ValueError("Database must be opened")
ValueError: Database must be opened

The Table class also supports the context manager protocol, so we can automatically close a table that has been opened:

with wt.open_table("pythons.wt") as t:
    print(len(t))
# t is now closed and cannot be accessed

The Table class supports the read-only Python sequence protocol, and so tables can be treated like a two-dimensional list in many ways. For example:

>>> t = wt.open_table("pythons.wt")
>>> t[0]
(0, b'John Cleese', 1939, 60, 127, 0, 43)
>>> t[-1]
(5, b'Graham Chapman', 1941, 46, 24, 0, 2)
>>> t[4:]
[(4, b'Michael Palin', 1943, 58, 56, 0, 1), (5, b'Graham Chapman', 1941, 46, 24, 0, 2)]

Rows are returned as tuples, with values for each column occupying the corresponding position. Each table consists of a fixed number of columns, which describe the size and type of the data in the column. The Column class has some methods to query these types and sizes, and these are accessed either via the Table.columns() method, or the Table.get_column() method:

>>> [c.get_name() for c in t.columns()]
['row_id', 'name', 'born', 'writer', 'actor', 'director', 'producer']
>>> c = t.get_column("born")
>>> (c.get_type_name(), c.get_element_size())
('uint', 2)

This tells us that the born column holds unsigned integer data with an element size of 2, and so it can store values from 0 to 65534. See Column types for details on the various data types and sizes supported by wormtable.

The first column in every wormtable is an unsigned integer column, called row_id. This is the column used to index rows, and the size of this column determines the number of rows that can be stored in the table. As a result, we always have t[j][0] == j:

>>> [t[j][0] for j in range(len(t))]
[0, 1, 2, 3, 4, 5]

Cursors

Suppose we are only interested in the name and the birth year of the pythons. We could do something like:

>>> t = wt.open_table("pythons.wt")
>>> [(r[1], r[2]) for r in t]
[(b'John Cleese', 1939), (b'Terry Gilliam', 1940), (b'Eric Idle', 1943), (b'Terry Jones', 1942), (b'Michael Palin', 1943), (b'Graham Chapman', 1941)]

This is very inefficient if we have a large number of columns, because wormtable must build a tuple containing all of the columns in each row, even though most of this will not be used. It is also inconvenient: we must remember that the name column is in position 1, and the born column is in position 2.

A much more efficient and convenient approach is to use a cursor. Cursors provide a simple means of iterating over rows, and retrieving values for a given set of columns. Repeating the example above:

>>> [r for r in t.cursor(["name", "born"])]
[(b'John Cleese', 1939), (b'Terry Gilliam', 1940), (b'Eric Idle', 1943), (b'Terry Jones', 1942), (b'Michael Palin', 1943), (b'Graham Chapman', 1941)]

The Table.cursor() method returns an iterator over the rows in a table for a list of columns. (Cursors are intended to be used over very large datasets, and so we would not usually construct a list of the rows.) The Table.cursor() method also provides a way to restrict the rows retrieved from the table using the start and stop arguments (this is analogous to the built in range() function). For example, to only retrieve rows 1 to 3, we would do the following (rows are zero-indexed in wormtable):

>>> [r for r in t.cursor(["name", "born"], start=1, stop=4)]
[(b'Terry Gilliam', 1940), (b'Eric Idle', 1943), (b'Terry Jones', 1942)]

Note that start is inclusive and stop is exclusive.

Simple Indexes

Suppose we wished to rank the Monty Python team in terms of writing credits from IMDB. We could simply retrieve the columns that we are interested in and sort them in terms of the writer column using the built in sorted() function. This does not work very well, however, if we have millions of rows in our table. It is very slow, and may not even be possible if there are too many rows to fit in memory.

An index in wormtable is a persistent sorting of a table with respect to a given column (or list of columns, as we see in the Compound Indexes section). Indexes are extremely useful, and can be used to make many different operations more efficient. Each index has a name, which is its unique identifier. Indexes are created using the wtadmin command line tool.

To open an index on a table, we use the Table.open_index() method. For example, to open an index called writer, we might use:

>>> i = t.open_index("writer")

The Table.open_index() method is directly analogous to the open_table() function used to open tables. Indexes should be closed after use, like tables, and also support the context manager protocol to automatically close indexes:

with t.open_index("writer") as i:
    print(i.max_key())
# Index i is now closed and cannot be accessed

Indexes sort the keys in the columns of interest, and map these keys to the rows of the table where they are found. To get the minimum and maximum keys from the index, we use the Index.min_key() and Index.max_key() methods:

>>> i = t.open_index("writer")
>>> (i.min_key(), i.max_key())
(25, 60)

This tells us that the least productive Python has 25 writing credits on IMDB, and the most has 60. This does not tell us who they are though. To get information about other columns, we must use a cursor:

>>> for r in i.cursor(["name", "writer"]):
...     print(r)
...
(b'Terry Gilliam', 25)
(b'Eric Idle', 38)
(b'Graham Chapman', 46)
(b'Terry Jones', 50)
(b'Michael Palin', 58)
(b'John Cleese', 60)

Just like the Table.cursor() method, Index.cursor() iterates over rows in the table for a selection of columns. The difference between the two is that the order in which the rows are returned is the order defined by the index. The start and stop arguments to the function are also now in terms of index keys, and not row positions. This gives us a very flexible method of obtaining rows from the table based on the values that they contain. For example, if we are only interested in the Pythons who have between 30 (inclusive) and 50 (exclusive) writing credits, we can write:

>>> for r in i.cursor(["name", "writer"], start=30, stop=50):
...     print(r)
...
(b'Eric Idle', 38)
(b'Graham Chapman', 46)

Index Counters

To find out the number of rows in a table correspond to a given index key, we use a Counter object. This is closely modelled on a the collections.Counter class; it is a mapping from keys to the number of rows in the table containing this key. For example, if we make an index on the director Column:

>>> i = t.open_index("director")
>>> c = i.counter()
>>> for k, v in c.items():
...     print(k, "->",  v)
...
0 -> 3
7 -> 1
16 -> 1
18 -> 1

This shows that there are 3 Pythons who have directed 0 films, and the three others have directed 7, 16 and 18 respectively. Counters implement the read-only Python mapping protocol, and so can be treated very much like a dictionary:

>>> c[0]
3
>>> c[1]
0
>>> len(c)
4

Compound Indexes

Wormtable also supports indexes over more than one column. These differ from simple indexes in that the keys for each index are constructed by concatenating the values from the constituent columns, in the order that they are specified. For example, we can make an index on the columns director and producer, which we call director+producer:

>>> i = t.open_index("director+producer")
>>> for r in i.cursor(["name", "director", "producer"]):
...     print(r)
...
(b'Michael Palin', 0, 1)
(b'Graham Chapman', 0, 2)
(b'John Cleese', 0, 43)
(b'Eric Idle', 7, 5)
(b'Terry Jones', 16, 1)
(b'Terry Gilliam', 18, 8)

This lists the rows in the order defined by the index. Keys are sorted lexicographically, so that we sort on the first column first, and if there are duplicate values for the first column we then sort on the second column. Here, for example, we have Michael Palin, Graham Chapman and John Cleese have all directed 0 films. But since this is a compound index, we then sort on the producer column, giving the ordering that we see.

Since keys now contain values from multiple columns, the Index.min_key() and Index.max_key() now return tuples:

>>> i.min_key()
(0, 1)
>>> i.max_key()
(18, 8)

These are also more flexible now, however, as we can get the minimum and maximum keys with a given prefix:

>>> i.min_key(7)
(7, 5)
>>> i.max_key(0)
(0, 43)

The start and stop arguments to the Index.cursor() method also support this flexible key prefixing. Suppose we wish to find all the Pythons with at least 7 directorial credits:

>>> [r for r in i.cursor(["name", "director", "producer"], start=7)]
[(b'Eric Idle', 7, 5), (b'Terry Jones', 16, 1), (b'Terry Gilliam', 18, 8)]

We get the same answer if we specify 5 or less for the producer column:

>>> [r for r in i.cursor(["name", "director", "producer"], start=(7, 0))]
[(b'Eric Idle', 7, 5), (b'Terry Jones', 16, 1), (b'Terry Gilliam', 18, 8)]

But, we lose Eric Idle if we require 6 or more production credits:

>>> [r for r in i.cursor(["name", "director", "producer"], start=(7, 6))]
[(b'Terry Jones', 16, 1), (b'Terry Gilliam', 18, 8)]

Module reference

wormtable.open_table(homedir, db_cache_size='16M')

Returns a table opened in read mode with cache size set to the specified value. This is the recommended interface when opening tables for reading.

See Cache tuning for details on setting cache sizes. The cache size may be either an integer specifying the size in bytes or a string with the optional suffixes K, M or G.

Parameters:
  • homedir (str) – the filesystem path for the wormtable home directory
  • db_cache_size (str or int.) – The Berkeley DB cache size for the table.

Table class

class wormtable.Table

The table class represents a wormtable located in a home directory. The home directory for a table stores the Berkeley DB databases used to store the rows and indexes, along with some metadata to describe these tables. The files within a home directory are not intended to be accessed directly; modifications to a table should be made through this API only.

cursor(columns, start=0, stop=None)

Returns a cursor over the rows in this table, retrieving only the specified columns. Rows are returned as Tuple objects, with the value for each column in the same position as the corresponding column in the list of columns provided.

The columns specified may be either Column instances, integers or strings. If an integer is provided, the column at the specified position is used and if a string is provided, the column with the specified identifier is used. These may be mixed arbitrarily.

The start and stop arguments are directly analogous to the built in range() function. The cursor will iterate over all rows such that the start <= row_id < stop. Note that start is inclusive, and stop is exclusive.

Parameters:
  • columns (sequence of column identifiers) – columns to retrieve from the table
  • start (int) – the row id of the first row returned
  • stop (int) – the row id of the last row returned, minus 1.
open_index(index_name, db_cache_size='16M')

Returns an index with the specified name opened in read mode with the specified db_cache_size.

See Cache tuning for details on setting cache sizes. The cache size may be either an integer specifying the size in bytes or a string with the optional suffixes K, M or G.

Parameters:
  • index_name (str) – the name of the index to open
  • db_cache_size (str or int.) – the size of the cache on the index
open(mode)

Opens this table in the specified mode. Mode must be one of ‘r’ or ‘w’.

Param:mode: The mode to open the table in.
Type:mode: str
close()

Closes this table freeing all underlying resources.

columns()

Returns the list of columns in this table.

get_column(col_id)

Returns the Column corresponding to the specified id. If this is an integer, we return the column at this position; if it is a string we return the column with the specified name.

Param:col_id: the column idenifier
Type:col_id: str or int

Index class

class wormtable.Index

Indexes define a sorting order of the rows in a table. An index over a set of columns creates a sorting order over the table by concatenating the values from the columns in question together (the keys) and storing the mapping of these keys to the rows in the table in which the key occurs.

cursor(columns, start='KEY_UNSET', stop='KEY_UNSET')

Returns a cursor over the rows in the table in the order defined by this index, retrieving only the specified columns. Rows are returned as Tuple objects, with the value for each column in the same position as the corresponding column in the list of columns provided.

The columns specified may be either Column instances, integers or strings. If an integer is provided, the column at the specified position is used and if a string is provided, the column with the specified identifier is used. These may be mixed arbitrarily.

The start and stop arguments are analogous to the built in range() function. The cursor will iterate over all rows such that the start <= key < stop. Note that start is inclusive, and stop is exclusive. These parameters may specified values for up to n columns, for an n column index. For multiple values, a tuple must be provided; a single value of the relevant type is considered to be the same as a singleton tuple consisting of this value.

Parameters:
  • columns (sequence of column identifiers) – columns to retrieve from the table
  • start – the key prefix that is less than or equal to all keys in returned rows.
  • stop – the key prefix that is greater than all keys in returned rows.
open(mode)

Opens this index in the specified mode. Mode must be one of ‘r’ or ‘w’.

Param:mode: The mode to open the index in.
Type:mode: str
close()

Closes this Index.

min_key(*k)

Returns the smallest key greater than or equal to the specified prefix.

max_key(*k)

Returns the largest index key less than the specified prefix.

keys()

Returns an iterator over all the keys in this Index in sorted order.

counter()

Returns an IndexCounter object for this index. This provides an efficient method of iterating over the keys in the index.

Column class

class wormtable.Column

Columns define the storage types for values within a table.

get_name()

Returns the name of this column. This is the unique identifier for a column.

get_description()

Returns the description of this column. This is an optional string describing the purpose of a column.

get_type()

Returns the type code for this column. This is one of WT_INT, WT_UINT, WT_FLOAT or WT_CHAR.

get_type_name()

Returns the string representation of the type of this Column.

get_element_size()

Returns the size of each element in the column in bytes.

get_num_elements()

Returns the number of elements in this column. This is either a positive integer >= 1 or WT_VAR1. If the number of elements is WT_VAR1, the number of elements in the column is variable, from 0 to 255.