Tutorial¶
Release: | 0.1 |
---|---|
Date: | April 08, 2015 |
Wormtable is a method for storing and interacting with large scale tabular datasets. It provides those familiar with python the necessary tools to efficiently store, process and search datasets of essentially unlimited size. It is designed for tabular data with many rows each with a fixed number of columns. Wormtable can generate indexes of the information in these rows which allows the user to quickly access the information. Since the indexes are stored on the disk they can be used repeatedly.
In this tutorial we will be taking you through the steps to convert a file to a wormtable, index columns and perform some basic operations on the data by using a few examples.
Before you get started make sure you have installed Berkeley DB and then Wormtable. For more details see the installation instructions.
Throughout this tutorial, code lines beginning $
imply a bash shell and
>>>
imply a python shell.
The VCF format¶
Throughout this tutorial we will be using a Variant Call Format (VCF) table. This is a common format for storing DNA sequence and polymorphism data from high throughput genome sequencing projects. In this format rows are individual positions of a genome and are identified by the chromosome they occur on and the position on that chromosome. A variety of other pieces of information are stored as metadata in the proceeding columns. We will explain the relevant columns as they arise. For more information please consult the full specifications of a VCF file on the 1000 genomes website.
In the following examples we will be working with the following example VCF file with only 5 genomic positions taken from the 1000 genomes website.
##fileformat=VCFv4.0
##fileDate=20090805
##source=myImputationProgramV3.1
##reference=1000GenomesPilot-NCBI36
##phasing=partial
##INFO=<ID=NS,Number=1,Type=Integer,Description="Number of Samples With Data">
##INFO=<ID=DP,Number=1,Type=Integer,Description="Total Depth">
##INFO=<ID=AF,Number=.,Type=Float,Description="Allele Frequency">
##INFO=<ID=AA,Number=1,Type=String,Description="Ancestral Allele">
##INFO=<ID=DB,Number=0,Type=Flag,Description="dbSNP membership, build 129">
##INFO=<ID=H2,Number=0,Type=Flag,Description="HapMap2 membership">
##FILTER=<ID=q10,Description="Quality below 10">
##FILTER=<ID=s50,Description="Less than 50% of samples have data">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FORMAT=<ID=GQ,Number=1,Type=Integer,Description="Genotype Quality">
##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Read Depth">
##FORMAT=<ID=HQ,Number=2,Type=Integer,Description="Haplotype Quality">
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT NA00001 NA00002 NA00003
20 14370 rs6054257 G A 29 PASS NS=3;DP=14;AF=0.5;DB;H2 GT:GQ:DP:HQ 0|0:48:1:51,51 1|0:48:8:51,51 1/1:43:5:.,.
20 17330 . T A 3 q10 NS=3;DP=11;AF=0.017 GT:GQ:DP:HQ 0|0:49:3:58,50 0|1:3:5:65,3 0/0:41:3
20 1110696 rs6040355 A G,T 67 PASS NS=2;DP=10;AF=0.333,0.667;AA=T;DB GT:GQ:DP:HQ 1|2:21:6:23,27 2|1:2:0:18,2 2/2:35:4
20 1230237 . T . 47 PASS NS=3;DP=13;AA=T GT:GQ:DP:HQ 0|0:54:7:56,60 0|0:48:4:51,51 0/0:61:2
20 1234567 microsat1 GTCT G,GTACT 50 PASS NS=3;DP=9;AA=G GT:GQ:DP 0/1:35:4 0/2:17:2 1/1:40:3
Convert a VCF to Wormtable format¶
To convert a standard VCF file to Wormtable format use the provided utility
vcf2wt
. Like other utilities provided with Wormtable the description of
command line options and arguments can be displayed using “–help” flag
$ vcf2wt --help
Building a wormtable from a vcf file is easy:
$ vcf2wt sample.vcf sample.wt
In this command the VCF file (sample.vcf) converted into a wormtable stored in the directory sample.wt. If the directory already exists you will have to use the “–force” (or -f) argument to tell vcf2wt to overwrite the old wormtable:
$ vcf2wt -f sample.vcf sample.wt
Warning
Wormtable does not currently support very long strings, so it
may be necessary to truncate the ALT
and REF
columns when converting
a VCF. Use the --truncate
option to vcf2wt
to do this.
Using a cursor¶
Now that we have built our wormtable we can use the Python wormtable
module
(within a python shell) to interact with it:
>>> import wormtable
>>> table = wormtable.open_table('sample.wt') # open the wormtable
A convenient feature of Wormtable is the Table.cursor()
method
which allows us to retrieve information from any column in the table. In
our case, we use a cursor to return the genome position column “CHROM”
and “POS”. The cursor allows us to walk through the wormtable row by row
>>> for row in table.cursor(['CHROM', 'POS']):
... print(row)
...
(b'20', 14370)
(b'20', 17330)
(b'20', 1110696)
(b'20', 1230237)
(b'20', 1234567)
Note that since we can retrieve information from multiple columns, the names of the columns we want to retrieve are passed to the cursor as a list.
Warning
All character data in wormtable is returned as bytes values. For Python 3 users, this means they are not the same as strings, but must be decoded. For Python 2 users, there is no distinction between bytes and strings.
Building an index¶
To fully exploit a wormtable, it is necessary to index the columns that you are interested in. Indexes provide a way to quickly and efficiently access information from the wormtable based on the values in the indexed column.
In the following example, we’ll demonstrate how it is possible to access the
DNA sequence of the reference genome (which is stored in the “REF” column)
for any position in the genome by creating an index on genomic position. Adding
an index for a column can be accomplished with the wtadmin
utility. In this
example, to index the position column called “POS” we use:
$ wtadmin add sample.wt POS
Here, sample.wt
is the “home directory” which contains our wormtable and POS
is the name of the column to be indexed. This utility also allows us to remove
indexes (wtadmin rm
) or list the columns already indexed (wtadmin ls
).
If you want to list the columns that are available to index use
$ wtadmin show sample.wt
==============================================================
name type size n | description
==============================================================
0 row_id uint 5 1 | Primary key column
1 CHROM char 1 var(1) | chromosome: an identifier from the reference genome or an angle-bracketed ID String ("<ID>") pointing to a contig in the assembly file
2 POS uint 5 1 | position: The reference position, with the 1st base having position 1
3 ID char 1 var(1) | semi-colon separated list of unique identifiers where available
4 REF char 1 var(1) | reference base(s): Each base must be one of A,C,G,T,N (case insensitive)
5 ALT char 1 var(1) | comma separated list of alternate non-reference allelescalled on at least one of the samples
6 QUAL float 4 1 | phred-scaled quality score for the assertion made in ALT. i.e. -10log_10 prob(call in ALT is wrong).
7 FILTER char 1 var(1) | PASS if this position has passed all filters, i.e. a call is made at this position. Otherwise, if the site has not passed all filters, a semicolon-separated list of codes for filters that fail.
8 INFO.NS int 4 1 | Number of Samples With Data
9 INFO.DP int 4 1 | Total Depth
10 INFO.AF float 4 var(1) | Allele Frequency
11 INFO.AA char 1 var(1) | Ancestral Allele
12 INFO.DB uint 1 1 | dbSNP membership, build 129
13 INFO.H2 uint 1 1 | HapMap2 membership
14 NA00001.GT char 1 var(1) | Genotype
15 NA00001.GQ int 4 1 | Genotype Quality
16 NA00001.DP int 4 1 | Read Depth
17 NA00001.HQ int 4 2 | Haplotype Quality
18 NA00002.GT char 1 var(1) | Genotype
19 NA00002.GQ int 4 1 | Genotype Quality
20 NA00002.DP int 4 1 | Read Depth
21 NA00002.HQ int 4 2 | Haplotype Quality
22 NA00003.GT char 1 var(1) | Genotype
23 NA00003.GQ int 4 1 | Genotype Quality
24 NA00003.DP int 4 1 | Read Depth
25 NA00003.HQ int 4 2 | Haplotype Quality
Note that fields within the INFO column and the columns corresponding for individual samples have been represented as separate columns and named as [COLUMN].[FIELD]. This allows the user to create indexes on individual fields from these compound columns.
When building an index over a large table it can be useful to set the cache size to improve performance. A large cache size allows more of the index to fit into memory, therefore making the process more efficient.
$ wtadmin add --cache-size=4G sample.wt POS
Using an index¶
Now that we have built our wormtable and indexed on POS we can retrieve information from any position in the genome
>>> import wormtable
>>> table = wormtable.open_table('sample.wt') # open the wormtable
>>> position_index = table.open_index('POS') # open the index on POS
Note that if you have not already added the index using wtadmin add
you will not
be able to open the index in python. Also, worth noting is that, like cache sizes
when building tables or adding indexes, we can assign memory to both the table
and index when we open them by including the db_cache_size
as a second argument in
open_table()
or Table.open_index()
. For more details see
the sections on performance tuning.
The wormtable module offers a number of methods to interact with an
Index
>>> # Print the minimum and maximum keys in an index
>>> position_index.min_key()
14370
>>> position_index.max_key()
1234567
>>> # Use keys() to iterate through sorted value in the index
>>> for k in position_index.keys():
... print(k)
...
14370
17330
1110696
1230237
1234567
The Index
class also provides a Index.cursor()
method
to iterate over rows in a table. In the case of an index, however,
we visit the rows in the order defined by the index and can access
rows based on the index keys using the start and stop parameters.
For example, to retrieve the reference nucleotides we can use a cursor to return the REF
column for genomic positions in the range 1–1150000 using:
>>> for p in position_index.cursor(["REF"], start=1, stop=1150000):
... print(p[0])
...
b'G'
b'T'
b'A'
Note that the cursor always returns a tuple and we just print the first element here. It is also worth noting that like other ranges in Python, the maximum value is not included. For example, 1 to 100 would return 1 to 99 and not include 100.
Creating compound indexes¶
With multiple chromosomes, the example above could give multiple values for each position
because the POS column is not normally a unique identifier of genomic position and our
cursor will iterate over positions matching the range specified from multiple
chromosomes. To deal with this we can can make compound indexes. Compound
indexes allow the user identify all combinations of multiple columns from the
wormtable. For example, we can make a compound index of chromosome (CHROM) and
position (POS) to retrieve unique genomic positions. To add a compound column
we can again use the wtadmin
utility
$ wtadmin add sample.wt CHROM+POS
The names of multiple columns in a compound index are joined using “+” which
indicates to wtadmin
to make a compound index. It is important to realise that
the order that the columns are listed matters (CHROM+POS does not equal
POS+CHROM). With this new compound index we can specify a region of the genome
(chromosome 20, positions 1 to 1150000) unambiguously and iterate
through rows in this region, printing CHROM, POS and REF for each:
>>> import wormtable
>>> table = wormtable.open_table('sample.wt')
>>> chrompos_index = table.open_index('CHROM+POS')
>>> cols = ["CHROM", "POS", "REF"]
>>> for c, p, r in chrompos_index.cursor(cols, start=("20", 1), stop=("20", 1150000)):
... print(c, p, r)
...
b'20' 14370 b'G'
b'20' 17330 b'T'
b'20' 1110696 b'A'
Since we need to specify values for several columns, the start and stop arguments are tuples.
Using a counter¶
Another useful feature of Wormtable is the ability to count the number of items
matching unique keys in an index. A Counter
is a dictionary-like
object where the keys are index values which refer to the number of times that
key occurs in the table. For example, we can quickly and efficiently calculate the
fraction of reference sites that are G or C (the GC content) by first creating
an index on the REF column:
$ wtadmin add sample.wt REF
Then in python:
>>> import wormtable
>>> table = wormtable.open_table('sample.wt')
>>> ref_index = table.open_index('REF')
>>> ref_counts = ref_index.counter()
>>> gc = ref_counts[b'G'] + ref_counts[b'C']
>>> tot = gc + ref_counts[b'T'] + ref_counts[b'A']
>>> gc / tot
0.25
Using binned indexes¶
Some columns in a VCF contain floats and can therefore have a huge number of
distinct values. In these cases it is useful to condense similar values into
‘binned’ indexes. For example, in a VCF the column which records the quality of
a row (QUAL column) is a float which may range from 0 to 10,000 (or more). For the
purposes of filtering on this column (i.e. creating an index) it may not be
necessary to discern between sites with quality of 50.1 from sites with quality
of 50.2. Using wtadmin
you can index a column binning indexes into equal sized
bins of size n
like this
$ wtadmin add sample.wt QUAL[n]
where n is an integer or float. This will make a new index on QUAL where all the QUAL
values are grouped into bins of size n. We can then use this binned index
to interact with our wormtable and print the number of rows matching QUAL scores
in bins between 0 and 70 using the Index.counter()
function.
For example, to create an index with bin size 5, we use:
$ wtadmin add sample.wt QUAL[5]
Then, we can quickly count the number of rows falling into each bin:
>>> qual_5_index = table.open_index('QUAL[5]')
>>> qual_5_counter = qual_5_index.counter()
>>> for q in range(0, 70, 5):
... print(q, "\t", qual_5_counter[q])
...
0 1
5 0
10 0
15 0
20 0
25 1
30 0
35 0
40 0
45 1
50 1
55 0
60 0
65 1
Examples¶
Along with the main program we have included a number of example scripts which will help you get started with using Wormtable. The full scripts are available should you want to use or modify the example scripts for your own purposes.
Counting index keys¶
In this example we use an index counter to get an iterator over the keys
and their counts in an index. We use the
context manager protocol
(the with
statement) to ensure that the table and index
are closed when we finish.
import wormtable as wt
def count_distinct(homedir, index):
with wt.open_table(homedir) as t, t.open_index(index) as i:
for k,v in i.counter().items():
yield k, v
Using this function we can easily print out all of the values in the REF column and their counts:
>>> for k, v in count_distinct("sample.wt", "REF"):
... print(k, "\t", v)
...
b'A' 1
b'G' 1
b'GTC' 1
b'T' 2
This functionality is also provided by the wtadmin hist
command:
$ wtadmin hist sample.wt REF
# n REF
1 A
1 G
1 GTC
2 T
Transition-Transversion ratio¶
This example uses a compound index of the reference nucleotide (REF) and the alternate
nucleotide (ALT) to count the number of transitions (changes A <-> G or C <-> T) and
transversions (A or G <-> C or T). Using the counter feature this task can be very fast
with Wormtable. First we use Python’s itertools
to generate a list of all possible
single bases changes (ie all pairs of A,C,G and T). We then count the number of
instances of each change in our data
import wormtable
from itertools import permutations
def count_Ts_Tv(homedir):
"""
Count number of of transitions and transversions using an index on REF+ALT
"""
subs = [p for p in permutations([b'A',b'C',b'G',b'T'], 2)]
bases = {b'A':'purine', b'G':'purine', b'C':'pyrimidine', b'T':'pyrimidine'}
t = wormtable.open_table(homedir)
i = t.open_index("REF+ALT")
Ts, Tv = 0, 0
c = i.counter()
for s in subs:
if bases[s[0]] == bases[s[1]]:
Ts += c[s]
else:
Tv += c[s]
i.close()
t.close()
return Ts, Tv
we can then use this function to very quickly count the number of transitions and transversions:
>>> count_Ts_Tv('sample.wt')
(1, 1)
High Quality SNPs¶
In this example we wish to examine the sites in a VCF that have a quality score over a particular minimum threshold. The function uses a QUAL index where QUAL scores have been grouped into bins of width 1 (QUAL[1]), and returns an iterator over all of the rows that fulfil the given quality requirements.
import wormtable as wt
def hq_snps(homedir, minq, cols):
with wt.open_table(homedir) as t, t.open_index("QUAL[1]") as i:
for row in i.cursor(cols, start=minq):
yield row
First we must create the required index:
$ wtadmin add sample.wt QUAL[1]
We can then use this function in to iterate over the rows of interest:
>>> for row in hq_snps('sample.wt', 30, ['CHROM', 'POS', 'REF', 'ALT', 'QUAL']):
... print(row)
...
(b'20', 1230237, b'T', b'', 47.0)
(b'20', 1234567, b'GTC', b'G,GTCT', 50.0)
(b'20', 1110696, b'A', b'G,T', 67.0)
VCF-Utilities¶
We have also provided three utilities (in the directory examples/vcf-utils) which will allow a user to use wormtable with VCF format files immediately. These scripts demonstrate the efficiency of using Wormtable with VCF files and are described briefly below.
snp-filter.py¶
This script runs through a VCF file (using a CHROM+POS compound index) and allows the user to extract (a semicolon separated list of) specific VCF fields using an arbitrary set of filters on numeric or text columns. For example, to find variants with a QUAL score > 500, depth of coverage (stored as DP in the INFO column) > 20, a genotype in sample “S1” of “0/1” and print out CHROM and POS for variants in a wormtable stored in sample.wt, the user can use the following call
snp-filter.py -f 'QUAL>500;INFO.DP>20;S1.GT==0/1' CHROM,POS sample.wt
The user can also optionally specify a particular region of the VCF using the CHROM:START-END syntax and either exclude, include or find indels.
sliding-mean.py¶
This script takes a comma separated list of numeric columns and the home directory containing the wormtable and will then calculate the mean of these numeric columns within non-overlapping windows (using an optionally specified window size and list of chromosomes). The output is in tab separated column format allowing the results to be easily plotted. For example, to calculate the mean of QUAL and depth of coverage (INFO.DP) in window sizes of 1Mb for chromosomes 1,2 and 3 from a wormtable stored in sample.wt, run
sliding-mean.py QUAL,INFO.DP 20 -w 1000000 sample.wt
hq-snps-bygt.py¶
This script takes a sample name and a specific genotype code, then builds a compound index on the sample genotype columns and quality score allowing the user to find, for example, high quality heterozygotes for the first sample. For example, to very efficiently obtain high quality heterozygotes (QUAL>10000) from sample NA00001, run
hq-snps-bygt.py -s NA00001 -g '0/1' -q 50 sample.wt