API Documentation¶
Release: | 0.1 |
---|---|
Date: | April 08, 2015 |
This is the API documentation for wormtable. The documentation currently concentrates the read-API, since the initial release is intended primarily for use with VCF data. For details on how to build a wormtable from a VCF file see the tutorial.
In the Examples section we take an informal tour of the API using
a small example table. The Module reference section provides concrete
API documentation for the wormtable
module.
Examples¶
To illustrate the wormtable
API, we use a wormtable pythons.wt which contains
the following data:
name | born | writer | actor | director | producer |
---|---|---|---|---|---|
John Cleese | 1939 | 60 | 127 | 0 | 43 |
Terry Gilliam | 1940 | 25 | 24 | 18 | 8 |
Eric Idle | 1943 | 38 | 74 | 7 | 5 |
Terry Jones | 1942 | 50 | 49 | 16 | 1 |
Michael Palin | 1943 | 58 | 56 | 0 | 1 |
Graham Chapman | 1941 | 46 | 24 | 0 | 2 |
This table consists of six columns: the name of the Python, the year they were born and the number of entries in IMDB they have under these headings as of 2013.
Tables¶
The open_table()
function returns a Table
object
opened for reading, and is analogous to the open()
function
from the Python standard library. So, to open our pythons.wt table
for reading, we might do the following:
>>> import wormtable as wt
>>> t = wt.open_table("pythons.wt")
>>> len(t)
6
Tables that are opened should be closed when they are no longer needed.
This is done using the Table.close()
method, again analogous
to Python file handling. Trying to access a closed table results in
an error:
>>> t.close()
>>> len(t)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "./wormtable.py", line 637, in __len__
self.verify_open()
File "./wormtable.py", line 447, in verify_open
raise ValueError("Database must be opened")
ValueError: Database must be opened
The Table
class also supports the
context manager
protocol, so we can automatically close a table that has
been opened:
with wt.open_table("pythons.wt") as t:
print(len(t))
# t is now closed and cannot be accessed
The Table
class supports the read-only Python sequence
protocol, and so tables can be treated like a two-dimensional list in
many ways. For example:
>>> t = wt.open_table("pythons.wt")
>>> t[0]
(0, b'John Cleese', 1939, 60, 127, 0, 43)
>>> t[-1]
(5, b'Graham Chapman', 1941, 46, 24, 0, 2)
>>> t[4:]
[(4, b'Michael Palin', 1943, 58, 56, 0, 1), (5, b'Graham Chapman', 1941, 46, 24, 0, 2)]
Rows are returned as tuples, with values for each column occupying
the corresponding position. Each table consists of a fixed number of columns, which
describe the size and type of the data in the column. The Column
class
has some methods to query these types and sizes, and these are accessed
either via the Table.columns()
method, or the Table.get_column()
method:
>>> [c.get_name() for c in t.columns()]
['row_id', 'name', 'born', 'writer', 'actor', 'director', 'producer']
>>> c = t.get_column("born")
>>> (c.get_type_name(), c.get_element_size())
('uint', 2)
This tells us that the born
column holds unsigned integer data with
an element size of 2, and so it can store values from 0 to 65534. See
Column types for details on the various data types
and sizes supported by wormtable.
The first column in every wormtable is an unsigned integer column,
called row_id
. This is the column used to index rows,
and the size of this column determines the number of rows that can be
stored in the table. As a result, we always have t[j][0] == j
:
>>> [t[j][0] for j in range(len(t))]
[0, 1, 2, 3, 4, 5]
Cursors¶
Suppose we are only interested in the name and the birth year of the pythons. We could do something like:
>>> t = wt.open_table("pythons.wt")
>>> [(r[1], r[2]) for r in t]
[(b'John Cleese', 1939), (b'Terry Gilliam', 1940), (b'Eric Idle', 1943), (b'Terry Jones', 1942), (b'Michael Palin', 1943), (b'Graham Chapman', 1941)]
This is very inefficient if we have a large number of columns, because wormtable
must build a tuple containing all of the columns in each row, even though most of this will
not be used. It is also inconvenient: we must remember that the name column is in position
1, and the born column is in position 2.
A much more efficient and convenient approach is to use a cursor. Cursors provide a simple means of iterating over rows, and retrieving values for a given set of columns. Repeating the example above:
>>> [r for r in t.cursor(["name", "born"])]
[(b'John Cleese', 1939), (b'Terry Gilliam', 1940), (b'Eric Idle', 1943), (b'Terry Jones', 1942), (b'Michael Palin', 1943), (b'Graham Chapman', 1941)]
The Table.cursor()
method returns an iterator over the rows in a table
for a list of columns. (Cursors are intended to
be used over very large datasets, and so we would not usually construct a list of the rows.)
The Table.cursor()
method also provides a way to restrict the rows
retrieved from the table using the start and stop arguments (this is analogous to the
built in range()
function).
For example, to only retrieve rows 1 to 3, we would
do the following (rows are zero-indexed in wormtable):
>>> [r for r in t.cursor(["name", "born"], start=1, stop=4)]
[(b'Terry Gilliam', 1940), (b'Eric Idle', 1943), (b'Terry Jones', 1942)]
Note that start is inclusive and stop is exclusive.
Simple Indexes¶
Suppose we wished to rank the Monty Python team in terms of writing credits from IMDB. We could
simply retrieve the columns that we are interested in and sort them in terms of the writer
column using the built in sorted()
function. This does not work very well, however, if we
have millions of rows in our table. It is very slow, and may not even be possible if there are
too many rows to fit in memory.
An index in wormtable is a persistent sorting of a table with respect to a given column
(or list of columns, as we see in the Compound Indexes section). Indexes are extremely useful, and
can be used to make many different operations more efficient. Each index has a name, which
is its unique identifier. Indexes are created using the
wtadmin
command line tool.
To open an index on a table, we use the Table.open_index()
method. For example, to open an index called writer
, we might use:
>>> i = t.open_index("writer")
The Table.open_index()
method is directly analogous to the open_table()
function
used to open tables. Indexes should be closed after use, like tables, and also support
the context manager protocol to
automatically close indexes:
with t.open_index("writer") as i:
print(i.max_key())
# Index i is now closed and cannot be accessed
Indexes sort the keys in the columns of interest, and map these keys to the rows
of the table where they are found. To get the minimum and maximum keys from the
index, we use the Index.min_key()
and Index.max_key()
methods:
>>> i = t.open_index("writer")
>>> (i.min_key(), i.max_key())
(25, 60)
This tells us that the least productive Python has 25 writing credits on IMDB, and the most has 60. This does not tell us who they are though. To get information about other columns, we must use a cursor:
>>> for r in i.cursor(["name", "writer"]):
... print(r)
...
(b'Terry Gilliam', 25)
(b'Eric Idle', 38)
(b'Graham Chapman', 46)
(b'Terry Jones', 50)
(b'Michael Palin', 58)
(b'John Cleese', 60)
Just like the Table.cursor()
method, Index.cursor()
iterates over
rows in the table for a selection of columns. The difference between the two
is that the order in which the rows are returned is the order defined by
the index. The start and stop arguments to the function are also now
in terms of index keys, and not row positions. This gives us a very flexible
method of obtaining rows from the table based on the values that they
contain. For example, if we are only interested in the Pythons who have between 30
(inclusive) and 50 (exclusive) writing credits, we can write:
>>> for r in i.cursor(["name", "writer"], start=30, stop=50):
... print(r)
...
(b'Eric Idle', 38)
(b'Graham Chapman', 46)
Index Counters¶
To find out the number of rows in a table correspond to a given index key, we use
a Counter object. This is closely modelled on a the collections.Counter
class; it is a mapping from keys to the number of rows in the table containing
this key. For example, if we make an index on the director
Column:
>>> i = t.open_index("director")
>>> c = i.counter()
>>> for k, v in c.items():
... print(k, "->", v)
...
0 -> 3
7 -> 1
16 -> 1
18 -> 1
This shows that there are 3 Pythons who have directed 0 films, and the three others have directed 7, 16 and 18 respectively. Counters implement the read-only Python mapping protocol, and so can be treated very much like a dictionary:
>>> c[0]
3
>>> c[1]
0
>>> len(c)
4
Compound Indexes¶
Wormtable also supports indexes over more than one column. These differ from simple
indexes in that the keys for each index are constructed by concatenating the
values from the constituent columns, in the order that they are specified. For example,
we can make an index on the columns director
and producer
, which we call
director+producer
:
>>> i = t.open_index("director+producer")
>>> for r in i.cursor(["name", "director", "producer"]):
... print(r)
...
(b'Michael Palin', 0, 1)
(b'Graham Chapman', 0, 2)
(b'John Cleese', 0, 43)
(b'Eric Idle', 7, 5)
(b'Terry Jones', 16, 1)
(b'Terry Gilliam', 18, 8)
This lists the rows in the order defined by the index. Keys are sorted lexicographically, so that we sort on the first column first, and if there are duplicate values for the first column we then sort on the second column. Here, for example, we have Michael Palin, Graham Chapman and John Cleese have all directed 0 films. But since this is a compound index, we then sort on the producer column, giving the ordering that we see.
Since keys now contain values from multiple columns, the Index.min_key()
and
Index.max_key()
now return tuples:
>>> i.min_key()
(0, 1)
>>> i.max_key()
(18, 8)
These are also more flexible now, however, as we can get the minimum and maximum keys with a given prefix:
>>> i.min_key(7)
(7, 5)
>>> i.max_key(0)
(0, 43)
The start and stop arguments to the Index.cursor()
method
also support this flexible key prefixing. Suppose we wish to
find all the Pythons with at least 7 directorial credits:
>>> [r for r in i.cursor(["name", "director", "producer"], start=7)]
[(b'Eric Idle', 7, 5), (b'Terry Jones', 16, 1), (b'Terry Gilliam', 18, 8)]
We get the same answer if we specify 5 or less for the producer
column:
>>> [r for r in i.cursor(["name", "director", "producer"], start=(7, 0))]
[(b'Eric Idle', 7, 5), (b'Terry Jones', 16, 1), (b'Terry Gilliam', 18, 8)]
But, we lose Eric Idle if we require 6 or more production credits:
>>> [r for r in i.cursor(["name", "director", "producer"], start=(7, 6))]
[(b'Terry Jones', 16, 1), (b'Terry Gilliam', 18, 8)]
Module reference¶
-
wormtable.
open_table
(homedir, db_cache_size='16M')¶ Returns a table opened in read mode with cache size set to the specified value. This is the recommended interface when opening tables for reading.
See Cache tuning for details on setting cache sizes. The cache size may be either an integer specifying the size in bytes or a string with the optional suffixes K, M or G.
Parameters: - homedir (str) – the filesystem path for the wormtable home directory
- db_cache_size (str or int.) – The Berkeley DB cache size for the table.
Table
class¶
-
class
wormtable.
Table
¶ The table class represents a wormtable located in a home directory. The home directory for a table stores the Berkeley DB databases used to store the rows and indexes, along with some metadata to describe these tables. The files within a home directory are not intended to be accessed directly; modifications to a table should be made through this API only.
-
cursor
(columns, start=0, stop=None)¶ Returns a cursor over the rows in this table, retrieving only the specified columns. Rows are returned as Tuple objects, with the value for each column in the same position as the corresponding column in the list of columns provided.
The columns specified may be either
Column
instances, integers or strings. If an integer is provided, the column at the specified position is used and if a string is provided, the column with the specified identifier is used. These may be mixed arbitrarily.The start and stop arguments are directly analogous to the built in
range()
function. The cursor will iterate over all rows such that the start <= row_id < stop. Note that start is inclusive, and stop is exclusive.Parameters:
-
open_index
(index_name, db_cache_size='16M')¶ Returns an index with the specified name opened in read mode with the specified db_cache_size.
See Cache tuning for details on setting cache sizes. The cache size may be either an integer specifying the size in bytes or a string with the optional suffixes K, M or G.
Parameters: - index_name (str) – the name of the index to open
- db_cache_size (str or int.) – the size of the cache on the index
-
open
(mode)¶ Opens this table in the specified mode. Mode must be one of ‘r’ or ‘w’.
Param: mode: The mode to open the table in. Type: mode: str
-
close
()¶ Closes this table freeing all underlying resources.
-
columns
()¶ Returns the list of columns in this table.
-
Index
class¶
-
class
wormtable.
Index
¶ Indexes define a sorting order of the rows in a table. An index over a set of columns creates a sorting order over the table by concatenating the values from the columns in question together (the keys) and storing the mapping of these keys to the rows in the table in which the key occurs.
-
cursor
(columns, start='KEY_UNSET', stop='KEY_UNSET')¶ Returns a cursor over the rows in the table in the order defined by this index, retrieving only the specified columns. Rows are returned as Tuple objects, with the value for each column in the same position as the corresponding column in the list of columns provided.
The columns specified may be either
Column
instances, integers or strings. If an integer is provided, the column at the specified position is used and if a string is provided, the column with the specified identifier is used. These may be mixed arbitrarily.The start and stop arguments are analogous to the built in
range()
function. The cursor will iterate over all rows such that the start <= key < stop. Note that start is inclusive, and stop is exclusive. These parameters may specified values for up to n columns, for an n column index. For multiple values, a tuple must be provided; a single value of the relevant type is considered to be the same as a singleton tuple consisting of this value.Parameters: - columns (sequence of column identifiers) – columns to retrieve from the table
- start – the key prefix that is less than or equal to all keys in returned rows.
- stop – the key prefix that is greater than all keys in returned rows.
-
open
(mode)¶ Opens this index in the specified mode. Mode must be one of ‘r’ or ‘w’.
Param: mode: The mode to open the index in. Type: mode: str
-
close
()¶ Closes this Index.
-
min_key
(*k)¶ Returns the smallest key greater than or equal to the specified prefix.
-
max_key
(*k)¶ Returns the largest index key less than the specified prefix.
-
keys
()¶ Returns an iterator over all the keys in this Index in sorted order.
-
counter
()¶ Returns an IndexCounter object for this index. This provides an efficient method of iterating over the keys in the index.
-
Column
class¶
-
class
wormtable.
Column
¶ Columns define the storage types for values within a table.
-
get_name
()¶ Returns the name of this column. This is the unique identifier for a column.
-
get_description
()¶ Returns the description of this column. This is an optional string describing the purpose of a column.
-
get_type
()¶ Returns the type code for this column. This is one of WT_INT, WT_UINT, WT_FLOAT or WT_CHAR.
-
get_type_name
()¶ Returns the string representation of the type of this Column.
-
get_element_size
()¶ Returns the size of each element in the column in bytes.
-
get_num_elements
()¶ Returns the number of elements in this column. This is either a positive integer >= 1 or WT_VAR1. If the number of elements is WT_VAR1, the number of elements in the column is variable, from 0 to 255.
-