Tutorial

Release:0.1
Date:April 08, 2015

Wormtable is a method for storing and interacting with large scale tabular datasets. It provides those familiar with python the necessary tools to efficiently store, process and search datasets of essentially unlimited size. It is designed for tabular data with many rows each with a fixed number of columns. Wormtable can generate indexes of the information in these rows which allows the user to quickly access the information. Since the indexes are stored on the disk they can be used repeatedly.

In this tutorial we will be taking you through the steps to convert a file to a wormtable, index columns and perform some basic operations on the data by using a few examples.

Before you get started make sure you have installed Berkeley DB and then Wormtable. For more details see the installation instructions.

Throughout this tutorial, code lines beginning $ imply a bash shell and >>> imply a python shell.

The VCF format

Throughout this tutorial we will be using a Variant Call Format (VCF) table. This is a common format for storing DNA sequence and polymorphism data from high throughput genome sequencing projects. In this format rows are individual positions of a genome and are identified by the chromosome they occur on and the position on that chromosome. A variety of other pieces of information are stored as metadata in the proceeding columns. We will explain the relevant columns as they arise. For more information please consult the full specifications of a VCF file on the 1000 genomes website.

In the following examples we will be working with the following example VCF file with only 5 genomic positions taken from the 1000 genomes website.

##fileformat=VCFv4.0
##fileDate=20090805
##source=myImputationProgramV3.1
##reference=1000GenomesPilot-NCBI36
##phasing=partial
##INFO=<ID=NS,Number=1,Type=Integer,Description="Number of Samples With Data">
##INFO=<ID=DP,Number=1,Type=Integer,Description="Total Depth">
##INFO=<ID=AF,Number=.,Type=Float,Description="Allele Frequency">
##INFO=<ID=AA,Number=1,Type=String,Description="Ancestral Allele">
##INFO=<ID=DB,Number=0,Type=Flag,Description="dbSNP membership, build 129">
##INFO=<ID=H2,Number=0,Type=Flag,Description="HapMap2 membership">
##FILTER=<ID=q10,Description="Quality below 10">
##FILTER=<ID=s50,Description="Less than 50% of samples have data">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FORMAT=<ID=GQ,Number=1,Type=Integer,Description="Genotype Quality">
##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Read Depth">
##FORMAT=<ID=HQ,Number=2,Type=Integer,Description="Haplotype Quality">
#CHROM POS     ID        REF ALT    QUAL FILTER INFO                              FORMAT      NA00001        NA00002        NA00003
20     14370   rs6054257 G      A       29   PASS   NS=3;DP=14;AF=0.5;DB;H2           GT:GQ:DP:HQ 0|0:48:1:51,51 1|0:48:8:51,51 1/1:43:5:.,.
20     17330   .         T      A       3    q10    NS=3;DP=11;AF=0.017               GT:GQ:DP:HQ 0|0:49:3:58,50 0|1:3:5:65,3   0/0:41:3
20     1110696 rs6040355 A      G,T     67   PASS   NS=2;DP=10;AF=0.333,0.667;AA=T;DB GT:GQ:DP:HQ 1|2:21:6:23,27 2|1:2:0:18,2   2/2:35:4
20     1230237 .         T      .       47   PASS   NS=3;DP=13;AA=T                   GT:GQ:DP:HQ 0|0:54:7:56,60 0|0:48:4:51,51 0/0:61:2
20     1234567 microsat1 GTCT   G,GTACT 50   PASS   NS=3;DP=9;AA=G                    GT:GQ:DP    0/1:35:4       0/2:17:2       1/1:40:3

Convert a VCF to Wormtable format

To convert a standard VCF file to Wormtable format use the provided utility vcf2wt. Like other utilities provided with Wormtable the description of command line options and arguments can be displayed using “–help” flag

$ vcf2wt --help

Building a wormtable from a vcf file is easy:

$ vcf2wt sample.vcf sample.wt

In this command the VCF file (sample.vcf) converted into a wormtable stored in the directory sample.wt. If the directory already exists you will have to use the “–force” (or -f) argument to tell vcf2wt to overwrite the old wormtable:

$ vcf2wt -f sample.vcf sample.wt

Warning

Wormtable does not currently support very long strings, so it may be necessary to truncate the ALT and REF columns when converting a VCF. Use the --truncate option to vcf2wt to do this.

Using a cursor

Now that we have built our wormtable we can use the Python wormtable module (within a python shell) to interact with it:

>>> import wormtable
>>> table = wormtable.open_table('sample.wt') # open the wormtable

A convenient feature of Wormtable is the Table.cursor() method which allows us to retrieve information from any column in the table. In our case, we use a cursor to return the genome position column “CHROM” and “POS”. The cursor allows us to walk through the wormtable row by row

>>> for row in table.cursor(['CHROM', 'POS']):
...     print(row)
...
(b'20', 14370)
(b'20', 17330)
(b'20', 1110696)
(b'20', 1230237)
(b'20', 1234567)

Note that since we can retrieve information from multiple columns, the names of the columns we want to retrieve are passed to the cursor as a list.

Warning

All character data in wormtable is returned as bytes values. For Python 3 users, this means they are not the same as strings, but must be decoded. For Python 2 users, there is no distinction between bytes and strings.

Building an index

To fully exploit a wormtable, it is necessary to index the columns that you are interested in. Indexes provide a way to quickly and efficiently access information from the wormtable based on the values in the indexed column.

In the following example, we’ll demonstrate how it is possible to access the DNA sequence of the reference genome (which is stored in the “REF” column) for any position in the genome by creating an index on genomic position. Adding an index for a column can be accomplished with the wtadmin utility. In this example, to index the position column called “POS” we use:

$ wtadmin add sample.wt POS

Here, sample.wt is the “home directory” which contains our wormtable and POS is the name of the column to be indexed. This utility also allows us to remove indexes (wtadmin rm) or list the columns already indexed (wtadmin ls). If you want to list the columns that are available to index use

$ wtadmin show sample.wt
==============================================================
       name         type     size   n        |   description
==============================================================
   0   row_id       uint        5   1        |   Primary key column
   1   CHROM        char        1   var(1)   |   chromosome: an identifier from the reference genome or an angle-bracketed ID String ("<ID>") pointing to a contig in the assembly file
   2   POS          uint        5   1        |   position: The reference position, with the 1st base having position 1
   3   ID           char        1   var(1)   |   semi-colon separated list of unique identifiers where available
   4   REF          char        1   var(1)   |   reference base(s): Each base must be one of A,C,G,T,N (case insensitive)
   5   ALT          char        1   var(1)   |   comma separated list of alternate non-reference allelescalled on at least one of the samples
   6   QUAL         float       4   1        |   phred-scaled quality score for the assertion made in ALT. i.e. -10log_10 prob(call in ALT is wrong).
   7   FILTER       char        1   var(1)   |   PASS if this position has passed all filters, i.e. a call is made at this position. Otherwise, if the site has not passed all filters, a semicolon-separated list of codes for filters that fail.
   8   INFO.NS      int         4   1        |   Number of Samples With Data
   9   INFO.DP      int         4   1        |   Total Depth
  10   INFO.AF      float       4   var(1)   |   Allele Frequency
  11   INFO.AA      char        1   var(1)   |   Ancestral Allele
  12   INFO.DB      uint        1   1        |   dbSNP membership, build 129
  13   INFO.H2      uint        1   1        |   HapMap2 membership
  14   NA00001.GT   char        1   var(1)   |   Genotype
  15   NA00001.GQ   int         4   1        |   Genotype Quality
  16   NA00001.DP   int         4   1        |   Read Depth
  17   NA00001.HQ   int         4   2        |   Haplotype Quality
  18   NA00002.GT   char        1   var(1)   |   Genotype
  19   NA00002.GQ   int         4   1        |   Genotype Quality
  20   NA00002.DP   int         4   1        |   Read Depth
  21   NA00002.HQ   int         4   2        |   Haplotype Quality
  22   NA00003.GT   char        1   var(1)   |   Genotype
  23   NA00003.GQ   int         4   1        |   Genotype Quality
  24   NA00003.DP   int         4   1        |   Read Depth
  25   NA00003.HQ   int         4   2        |   Haplotype Quality

Note that fields within the INFO column and the columns corresponding for individual samples have been represented as separate columns and named as [COLUMN].[FIELD]. This allows the user to create indexes on individual fields from these compound columns.

When building an index over a large table it can be useful to set the cache size to improve performance. A large cache size allows more of the index to fit into memory, therefore making the process more efficient.

$ wtadmin add --cache-size=4G sample.wt POS

Using an index

Now that we have built our wormtable and indexed on POS we can retrieve information from any position in the genome

>>> import wormtable
>>> table = wormtable.open_table('sample.wt') # open the wormtable
>>> position_index = table.open_index('POS')  # open the index on POS

Note that if you have not already added the index using wtadmin add you will not be able to open the index in python. Also, worth noting is that, like cache sizes when building tables or adding indexes, we can assign memory to both the table and index when we open them by including the db_cache_size as a second argument in open_table() or Table.open_index(). For more details see the sections on performance tuning. The wormtable module offers a number of methods to interact with an Index

>>> # Print the minimum and maximum keys in an index
>>> position_index.min_key()
14370
>>> position_index.max_key()
1234567
>>> # Use keys() to iterate through sorted value in the index
>>> for k in position_index.keys():
...     print(k)
...
14370
17330
1110696
1230237
1234567

The Index class also provides a Index.cursor() method to iterate over rows in a table. In the case of an index, however, we visit the rows in the order defined by the index and can access rows based on the index keys using the start and stop parameters. For example, to retrieve the reference nucleotides we can use a cursor to return the REF column for genomic positions in the range 1–1150000 using:

>>> for p in position_index.cursor(["REF"], start=1, stop=1150000):
...     print(p[0])
...
b'G'
b'T'
b'A'

Note that the cursor always returns a tuple and we just print the first element here. It is also worth noting that like other ranges in Python, the maximum value is not included. For example, 1 to 100 would return 1 to 99 and not include 100.

Creating compound indexes

With multiple chromosomes, the example above could give multiple values for each position because the POS column is not normally a unique identifier of genomic position and our cursor will iterate over positions matching the range specified from multiple chromosomes. To deal with this we can can make compound indexes. Compound indexes allow the user identify all combinations of multiple columns from the wormtable. For example, we can make a compound index of chromosome (CHROM) and position (POS) to retrieve unique genomic positions. To add a compound column we can again use the wtadmin utility

$ wtadmin add sample.wt CHROM+POS

The names of multiple columns in a compound index are joined using “+” which indicates to wtadmin to make a compound index. It is important to realise that the order that the columns are listed matters (CHROM+POS does not equal POS+CHROM). With this new compound index we can specify a region of the genome (chromosome 20, positions 1 to 1150000) unambiguously and iterate through rows in this region, printing CHROM, POS and REF for each:

>>> import wormtable
>>> table = wormtable.open_table('sample.wt')
>>> chrompos_index = table.open_index('CHROM+POS')
>>> cols = ["CHROM", "POS", "REF"]
>>> for c, p, r in chrompos_index.cursor(cols, start=("20", 1), stop=("20", 1150000)):
...     print(c, p, r)
...
b'20' 14370 b'G'
b'20' 17330 b'T'
b'20' 1110696 b'A'

Since we need to specify values for several columns, the start and stop arguments are tuples.

Using a counter

Another useful feature of Wormtable is the ability to count the number of items matching unique keys in an index. A Counter is a dictionary-like object where the keys are index values which refer to the number of times that key occurs in the table. For example, we can quickly and efficiently calculate the fraction of reference sites that are G or C (the GC content) by first creating an index on the REF column:

$ wtadmin add sample.wt REF

Then in python:

>>> import wormtable
>>> table = wormtable.open_table('sample.wt')
>>> ref_index = table.open_index('REF')
>>> ref_counts = ref_index.counter()
>>> gc = ref_counts[b'G'] + ref_counts[b'C']
>>> tot = gc + ref_counts[b'T'] + ref_counts[b'A']
>>> gc / tot
0.25

Using binned indexes

Some columns in a VCF contain floats and can therefore have a huge number of distinct values. In these cases it is useful to condense similar values into ‘binned’ indexes. For example, in a VCF the column which records the quality of a row (QUAL column) is a float which may range from 0 to 10,000 (or more). For the purposes of filtering on this column (i.e. creating an index) it may not be necessary to discern between sites with quality of 50.1 from sites with quality of 50.2. Using wtadmin you can index a column binning indexes into equal sized bins of size n like this

$ wtadmin add sample.wt QUAL[n]

where n is an integer or float. This will make a new index on QUAL where all the QUAL values are grouped into bins of size n. We can then use this binned index to interact with our wormtable and print the number of rows matching QUAL scores in bins between 0 and 70 using the Index.counter() function. For example, to create an index with bin size 5, we use:

$ wtadmin add sample.wt QUAL[5]

Then, we can quickly count the number of rows falling into each bin:

>>> qual_5_index = table.open_index('QUAL[5]')
>>> qual_5_counter = qual_5_index.counter()
>>> for q in range(0, 70, 5):
...     print(q, "\t", qual_5_counter[q])
...
0    1
5    0
10   0
15   0
20   0
25   1
30   0
35   0
40   0
45   1
50   1
55   0
60   0
65   1

Examples

Along with the main program we have included a number of example scripts which will help you get started with using Wormtable. The full scripts are available should you want to use or modify the example scripts for your own purposes.

Counting index keys

In this example we use an index counter to get an iterator over the keys and their counts in an index. We use the context manager protocol (the with statement) to ensure that the table and index are closed when we finish.

import wormtable as wt

def count_distinct(homedir, index):
    with wt.open_table(homedir) as t, t.open_index(index) as i:
        for k,v in i.counter().items():
            yield k, v

Using this function we can easily print out all of the values in the REF column and their counts:

>>> for k, v in count_distinct("sample.wt", "REF"):
...     print(k, "\t", v)
...
b'A'     1
b'G'     1
b'GTC'   1
b'T'     2

This functionality is also provided by the wtadmin hist command:

$ wtadmin hist sample.wt REF
# n REF
1    A
1    G
1    GTC
2    T

Transition-Transversion ratio

This example uses a compound index of the reference nucleotide (REF) and the alternate nucleotide (ALT) to count the number of transitions (changes A <-> G or C <-> T) and transversions (A or G <-> C or T). Using the counter feature this task can be very fast with Wormtable. First we use Python’s itertools to generate a list of all possible single bases changes (ie all pairs of A,C,G and T). We then count the number of instances of each change in our data

import wormtable
from itertools import permutations
def count_Ts_Tv(homedir):
    """
    Count number of of transitions and transversions using an index on REF+ALT
    """
    subs = [p for p in permutations([b'A',b'C',b'G',b'T'], 2)]
    bases = {b'A':'purine', b'G':'purine', b'C':'pyrimidine', b'T':'pyrimidine'}
    t = wormtable.open_table(homedir)
    i = t.open_index("REF+ALT")
    Ts, Tv = 0, 0
    c = i.counter()
    for s in subs:
        if bases[s[0]] == bases[s[1]]:
            Ts += c[s]
        else:
            Tv += c[s]
    i.close()
    t.close()
    return Ts, Tv

we can then use this function to very quickly count the number of transitions and transversions:

>>> count_Ts_Tv('sample.wt')
(1, 1)

High Quality SNPs

In this example we wish to examine the sites in a VCF that have a quality score over a particular minimum threshold. The function uses a QUAL index where QUAL scores have been grouped into bins of width 1 (QUAL[1]), and returns an iterator over all of the rows that fulfil the given quality requirements.

import wormtable as wt
def hq_snps(homedir, minq, cols):
    with wt.open_table(homedir) as t, t.open_index("QUAL[1]") as i:
        for row in i.cursor(cols, start=minq):
            yield row

First we must create the required index:

$ wtadmin add sample.wt QUAL[1]

We can then use this function in to iterate over the rows of interest:

>>> for row in hq_snps('sample.wt', 30, ['CHROM', 'POS', 'REF', 'ALT', 'QUAL']):
...     print(row)
...
(b'20', 1230237, b'T', b'', 47.0)
(b'20', 1234567, b'GTC', b'G,GTCT', 50.0)
(b'20', 1110696, b'A', b'G,T', 67.0)

VCF-Utilities

We have also provided three utilities (in the directory examples/vcf-utils) which will allow a user to use wormtable with VCF format files immediately. These scripts demonstrate the efficiency of using Wormtable with VCF files and are described briefly below.

snp-filter.py

This script runs through a VCF file (using a CHROM+POS compound index) and allows the user to extract (a semicolon separated list of) specific VCF fields using an arbitrary set of filters on numeric or text columns. For example, to find variants with a QUAL score > 500, depth of coverage (stored as DP in the INFO column) > 20, a genotype in sample “S1” of “0/1” and print out CHROM and POS for variants in a wormtable stored in sample.wt, the user can use the following call

snp-filter.py -f 'QUAL>500;INFO.DP>20;S1.GT==0/1' CHROM,POS sample.wt

The user can also optionally specify a particular region of the VCF using the CHROM:START-END syntax and either exclude, include or find indels.

sliding-mean.py

This script takes a comma separated list of numeric columns and the home directory containing the wormtable and will then calculate the mean of these numeric columns within non-overlapping windows (using an optionally specified window size and list of chromosomes). The output is in tab separated column format allowing the results to be easily plotted. For example, to calculate the mean of QUAL and depth of coverage (INFO.DP) in window sizes of 1Mb for chromosomes 1,2 and 3 from a wormtable stored in sample.wt, run

sliding-mean.py QUAL,INFO.DP 20 -w 1000000 sample.wt

hq-snps-bygt.py

This script takes a sample name and a specific genotype code, then builds a compound index on the sample genotype columns and quality score allowing the user to find, for example, high quality heterozygotes for the first sample. For example, to very efficiently obtain high quality heterozygotes (QUAL>10000) from sample NA00001, run

hq-snps-bygt.py -s NA00001 -g '0/1' -q 50 sample.wt