Performance tuning¶
Release: | 0.1 |
---|---|
Date: | April 08, 2015 |
One of the main goals of wormtable is to provide a high performance platform for processing tables that is easy to use. While wormtable should be quite efficient in most cases, there are a few things which can improve performance considerably when working with very large tables.
Schema tuning¶
To get the best performance possible from wormtable, it is important that
the rows are as small as possible. This is done by either using the smallest
possible type for each column, or discarding the column entirely if
it is not needed. Schemas generated using vcf2wt
for VCF files are
conservative, and can often be improved on considerably by some hand tuning. By
shaving a few tens of bytes off each row we can sometimes make the table
gigabytes smaller. Since the main thing that slows down wormtable is
reading data off disk, the smaller the table the faster it is.
To tune the schema of a VCF file, we must first use the --generate
option in
vcf2wt
to generate a candidate schema:
$ vcf2wt -g data.vcf schema.xml
The schema.xml
file then contains the automatically generated (conservative)
schema. We can then edit this schema, changing and deleting columns
as we see fit. Once we’re happy with the schema, we can then build the table
using this new schema using the --schema
option:
$ vcf2wt -s schema.xml data.vcf data.wt
There are many ways in which we can make a more compact schema.
In the following, we’ll work with a
VCF file from the Drosophila Genetic Reference Panel (available
here),
improving the schema from that automatically generated by vcf2wt
.
Consider,
for example, the following segment of the schema generated by
vcf2wt
:
<column description="Genotype" element_size="1" element_type="char" name="DGRP-021.GT" num_elements="var(1)"/>
<column description="number of supporting reads" element_size="4" element_type="int" name="DGRP-021.SR" num_elements="1"/>
<column description="number of opposing reads" element_size="4" element_type="int" name="DGRP-021.OR" num_elements="1"/>
<column description="Genotype quality" element_size="4" element_type="int" name="DGRP-021.GQ" num_elements="1"/>
There are several ways in which this schema is less than optimal. Firstly, we know that
this VCF is for diploids, and so the genotype columns (e.g DGRP-021.GT
above)
are always exactly three characters long. Yet, in this schema the number of
elements is var(1)
, allowing variable sized strings to be stored in this column. Variable
sized columns have an overhead of three bytes above the actual values stored,
and so in this case we are using twice as many bytes as we should be. To rectify this,
we change the num_elements
to 3
:
<column description="Genotype" element_size="1" element_type="char" name="DGRP-021.GT" num_elements="3"/>
The next two columns are of the number of supporting reads and number of opposing reads
at a variant site, and these are stored using four byte integers. Therefore, these
columns DGRP-021.SR
and DGRP-21.OR
can store integers in the range
-2147483647 and 2147483647
(see Integer columns)
. This range is far too large. A more suitable type for
these columns are 2 byte unsigned integers, giving a range of
0 to 65534 (see Unsigned integer columns). The new lines look like:
<column description="number of supporting reads" element_size="2" element_type="uint" name="DGRP-021.SR" num_elements="1"/>
<column description="number of opposing reads" element_size="2" element_type="uint" name="DGRP-021.OR" num_elements="1"/>
Finally, we see that the genotype quality column is also a four byte signed integer, when a 1 byte unsigned integer suffices to store the values in this VCF:
<column description="Genotype quality" element_size="1" element_type="int" name="DGRP-021.GQ" num_elements="1"/>
We have saved a total of 8 bytes over the default schema by making these changes. This hardly seems worth the effort, but is in fact quite significant for two reasons. Firstly, every byte we save per row really does count when we are storing millions of rows. Secondly, there are 205 genotypes in this VCF so we can make a saving of 1640 bytes by applying these changes to all of the relevant columns.
Another way in which we can save space is to delete columns that we are not interested
in or that don’t contain any information. For example, in the Drosophila VCF above,
the ID
and QUAL
columns contain only missing values, and the FILTER
column only contains only PASS
. We can simply delete these columns from the
schema, to save another 14 bytes per row.
This tweaking makes a considerable difference.
The source VCF file is 2.8GB when gzip compressed, and 15GB uncompressed. When we
use the automatic schema from vcf2wt
the resulting wormtable data file
is 21.4GB. When we make the changes mentioned above, however,
the data file requires only 9.7GB.
Half precision floats¶
Half precision floats provide a useful means of saving space when we have a
lot of floating point data. A good example of this are the VCF files from the
1000 Genomes project. These VCF files have a
very large number of samples, and use floating point columns for each sample.
For example, for
this VCF
vcf2wt
generates the following schema fragment:
<column description="Genotype" element_size="1" element_type="char" name="HG00096.GT" num_elements="var(1)"/>
<column description="Genotype dosage from MaCH/Thunder" element_size="4" element_type="float" name="HG00096.DS" num_elements="1"/>
<column description="Genotype Likelihoods" element_size="4" element_type="float" name="HG00096.GL" num_elements="var(1)"/>
Each of the .DS
and .GL
columns uses 4 byte floating point values, even
though the input values are small with very low precision. In this case, half precision
floats are perfect, and save a great deal of space. Changing the variable length
columns to fixed length columns again and using 2 byte floats, we get the
following schema fragment:
<column description="Genotype" element_size="1" element_type="char" name="HG00096.GT" num_elements="3"/>
<column description="Genotype dosage from MaCH/Thunder" element_size="2" element_type="float" name="HG00096.DS" num_elements="1"/>
<column description="Genotype Likelihoods" element_size="2" element_type="float" name="HG00096.GL" num_elements="3"/>
Applying these changes to all samples makes a considerable difference: using the default schema, the wormtable datafile is 77GB, but using the modified schema gives us a data file of 34GB. It should be emphasised here that there is no loss of information in this case. All the floating point values in the input VCF have at most three decimal places of precision, which half precision floats can represent exactly.
Cache tuning¶
Wormtable uses Berkeley DB databases to store the locations of
rows in the datafile and to create indexes on columns. An
important performance tuning factor is the db_cache_size
for the Table
and Index
classes. The
db_cache_size
essentially determines how much of these
databases is held in memory, and typically, for performance
purposes we would like to have the entire database in
memory if possible.
In many cases, such as a sequential full table scan, a large cache size will make very little difference, so it is not a good idea to have a large cache size by default. There are certain situations, however, when a large db cache is definitely a good idea.
When we are building an index, performance can suffer quite badly
if sufficient cache is not provided, since Berkeley DB will
need to write pages to disk and subsequently read them back.
It if therefore a good idea to provide a large cache size
when creating an index (several gigabytes is usually a good
choice). There is no harm in specifiying a cache size larger
than is required, since the db_cache_size
is an upper
limit on the amount of memory used. Berkeley DB will only
use as much memory as is needed to keep the database
in memory.
For further information, see the discussion on setting cache sizes for Berkeley DB.