Performance tuning

Release:0.1
Date:April 08, 2015

One of the main goals of wormtable is to provide a high performance platform for processing tables that is easy to use. While wormtable should be quite efficient in most cases, there are a few things which can improve performance considerably when working with very large tables.

Schema tuning

To get the best performance possible from wormtable, it is important that the rows are as small as possible. This is done by either using the smallest possible type for each column, or discarding the column entirely if it is not needed. Schemas generated using vcf2wt for VCF files are conservative, and can often be improved on considerably by some hand tuning. By shaving a few tens of bytes off each row we can sometimes make the table gigabytes smaller. Since the main thing that slows down wormtable is reading data off disk, the smaller the table the faster it is.

To tune the schema of a VCF file, we must first use the --generate option in vcf2wt to generate a candidate schema:

$ vcf2wt -g data.vcf schema.xml

The schema.xml file then contains the automatically generated (conservative) schema. We can then edit this schema, changing and deleting columns as we see fit. Once we’re happy with the schema, we can then build the table using this new schema using the --schema option:

$ vcf2wt -s schema.xml data.vcf data.wt

There are many ways in which we can make a more compact schema. In the following, we’ll work with a VCF file from the Drosophila Genetic Reference Panel (available here), improving the schema from that automatically generated by vcf2wt.

Consider, for example, the following segment of the schema generated by vcf2wt:

<column description="Genotype" element_size="1" element_type="char" name="DGRP-021.GT" num_elements="var(1)"/>
<column description="number of supporting reads" element_size="4" element_type="int" name="DGRP-021.SR" num_elements="1"/>
<column description="number of opposing reads" element_size="4" element_type="int" name="DGRP-021.OR" num_elements="1"/>
<column description="Genotype quality" element_size="4" element_type="int" name="DGRP-021.GQ" num_elements="1"/>

There are several ways in which this schema is less than optimal. Firstly, we know that this VCF is for diploids, and so the genotype columns (e.g DGRP-021.GT above) are always exactly three characters long. Yet, in this schema the number of elements is var(1), allowing variable sized strings to be stored in this column. Variable sized columns have an overhead of three bytes above the actual values stored, and so in this case we are using twice as many bytes as we should be. To rectify this, we change the num_elements to 3:

<column description="Genotype" element_size="1" element_type="char" name="DGRP-021.GT" num_elements="3"/>

The next two columns are of the number of supporting reads and number of opposing reads at a variant site, and these are stored using four byte integers. Therefore, these columns DGRP-021.SR and DGRP-21.OR can store integers in the range -2147483647 and 2147483647 (see Integer columns) . This range is far too large. A more suitable type for these columns are 2 byte unsigned integers, giving a range of 0 to 65534 (see Unsigned integer columns). The new lines look like:

<column description="number of supporting reads" element_size="2" element_type="uint" name="DGRP-021.SR" num_elements="1"/>
<column description="number of opposing reads" element_size="2" element_type="uint" name="DGRP-021.OR" num_elements="1"/>

Finally, we see that the genotype quality column is also a four byte signed integer, when a 1 byte unsigned integer suffices to store the values in this VCF:

<column description="Genotype quality" element_size="1" element_type="int" name="DGRP-021.GQ" num_elements="1"/>

We have saved a total of 8 bytes over the default schema by making these changes. This hardly seems worth the effort, but is in fact quite significant for two reasons. Firstly, every byte we save per row really does count when we are storing millions of rows. Secondly, there are 205 genotypes in this VCF so we can make a saving of 1640 bytes by applying these changes to all of the relevant columns.

Another way in which we can save space is to delete columns that we are not interested in or that don’t contain any information. For example, in the Drosophila VCF above, the ID and QUAL columns contain only missing values, and the FILTER column only contains only PASS. We can simply delete these columns from the schema, to save another 14 bytes per row.

This tweaking makes a considerable difference. The source VCF file is 2.8GB when gzip compressed, and 15GB uncompressed. When we use the automatic schema from vcf2wt the resulting wormtable data file is 21.4GB. When we make the changes mentioned above, however, the data file requires only 9.7GB.

Half precision floats

Half precision floats provide a useful means of saving space when we have a lot of floating point data. A good example of this are the VCF files from the 1000 Genomes project. These VCF files have a very large number of samples, and use floating point columns for each sample. For example, for this VCF vcf2wt generates the following schema fragment:

<column description="Genotype" element_size="1" element_type="char" name="HG00096.GT" num_elements="var(1)"/>
<column description="Genotype dosage from MaCH/Thunder" element_size="4" element_type="float" name="HG00096.DS" num_elements="1"/>
<column description="Genotype Likelihoods" element_size="4" element_type="float" name="HG00096.GL" num_elements="var(1)"/>

Each of the .DS and .GL columns uses 4 byte floating point values, even though the input values are small with very low precision. In this case, half precision floats are perfect, and save a great deal of space. Changing the variable length columns to fixed length columns again and using 2 byte floats, we get the following schema fragment:

<column description="Genotype" element_size="1" element_type="char" name="HG00096.GT" num_elements="3"/>
<column description="Genotype dosage from MaCH/Thunder" element_size="2" element_type="float" name="HG00096.DS" num_elements="1"/>
<column description="Genotype Likelihoods" element_size="2" element_type="float" name="HG00096.GL" num_elements="3"/>

Applying these changes to all samples makes a considerable difference: using the default schema, the wormtable datafile is 77GB, but using the modified schema gives us a data file of 34GB. It should be emphasised here that there is no loss of information in this case. All the floating point values in the input VCF have at most three decimal places of precision, which half precision floats can represent exactly.

Cache tuning

Wormtable uses Berkeley DB databases to store the locations of rows in the datafile and to create indexes on columns. An important performance tuning factor is the db_cache_size for the Table and Index classes. The db_cache_size essentially determines how much of these databases is held in memory, and typically, for performance purposes we would like to have the entire database in memory if possible.

In many cases, such as a sequential full table scan, a large cache size will make very little difference, so it is not a good idea to have a large cache size by default. There are certain situations, however, when a large db cache is definitely a good idea.

When we are building an index, performance can suffer quite badly if sufficient cache is not provided, since Berkeley DB will need to write pages to disk and subsequently read them back. It if therefore a good idea to provide a large cache size when creating an index (several gigabytes is usually a good choice). There is no harm in specifiying a cache size larger than is required, since the db_cache_size is an upper limit on the amount of memory used. Berkeley DB will only use as much memory as is needed to keep the database in memory.

For further information, see the discussion on setting cache sizes for Berkeley DB.