.. _performance-index: .. py:currentmodule:: wormtable ========================== Performance tuning ========================== :Release: |version| :Date: |today| One of the main goals of wormtable is to provide a high performance platform for processing tables that is easy to use. While wormtable should be quite efficient in most cases, there are a few things which can improve performance considerably when working with very large tables. .. _performance-schema: ------------- Schema tuning ------------- To get the best performance possible from wormtable, it is important that the rows are as small as possible. This is done by either using the smallest possible type for each column, or discarding the column entirely if it is not needed. Schemas generated using ``vcf2wt`` for VCF files are conservative, and can often be improved on considerably by some hand tuning. By shaving a few tens of bytes off each row we can sometimes make the table gigabytes smaller. Since the main thing that slows down wormtable is reading data off disk, the smaller the table the faster it is. To tune the schema of a VCF file, we must first use the ``--generate`` option in ``vcf2wt`` to generate a candidate schema:: $ vcf2wt -g data.vcf schema.xml The ``schema.xml`` file then contains the automatically generated (conservative) schema. We can then edit this schema, changing and deleting columns as we see fit. Once we're happy with the schema, we can then build the table using this new schema using the ``--schema`` option:: $ vcf2wt -s schema.xml data.vcf data.wt There are many ways in which we can make a more compact schema. In the following, we'll work with a VCF file from the Drosophila Genetic Reference Panel (available `here `_), improving the schema from that automatically generated by ``vcf2wt``. Consider, for example, the following segment of the schema generated by ``vcf2wt``: .. code-block:: xml There are several ways in which this schema is less than optimal. Firstly, we know that this VCF is for diploids, and so the genotype columns (e.g ``DGRP-021.GT`` above) are always exactly three characters long. Yet, in this schema the number of elements is ``var(1)``, allowing variable sized strings to be stored in this column. Variable sized columns have an overhead of three bytes above the actual values stored, and so in this case we are using twice as many bytes as we should be. To rectify this, we change the ``num_elements`` to ``3``: .. code-block:: xml The next two columns are of the number of supporting reads and number of opposing reads at a variant site, and these are stored using four byte integers. Therefore, these columns ``DGRP-021.SR`` and ``DGRP-21.OR`` can store integers in the range -2147483647 and 2147483647 (see :ref:`int-types-index`) . This range is far too large. A more suitable type for these columns are 2 byte unsigned integers, giving a range of 0 to 65534 (see :ref:`uint-types-index`). The new lines look like: .. code-block:: xml Finally, we see that the genotype quality column is also a four byte signed integer, when a 1 byte unsigned integer suffices to store the values in this VCF: .. code-block:: xml We have saved a total of 8 bytes over the default schema by making these changes. This hardly seems worth the effort, but is in fact quite significant for two reasons. Firstly, every byte we save per row really does count when we are storing millions of rows. Secondly, there are 205 genotypes in this VCF so we can make a saving of 1640 bytes by applying these changes to all of the relevant columns. Another way in which we can save space is to delete columns that we are not interested in or that don't contain any information. For example, in the Drosophila VCF above, the ``ID`` and ``QUAL`` columns contain only missing values, and the ``FILTER`` column only contains only ``PASS``. We can simply delete these columns from the schema, to save another 14 bytes per row. This tweaking makes a considerable difference. The source VCF file is 2.8GB when gzip compressed, and 15GB uncompressed. When we use the automatic schema from ``vcf2wt`` the resulting wormtable data file is 21.4GB. When we make the changes mentioned above, however, the data file requires only 9.7GB. ********************* Half precision floats ********************* Half precision floats provide a useful means of saving space when we have a lot of floating point data. A good example of this are the VCF files from the `1000 Genomes `_ project. These VCF files have a very large number of samples, and use floating point columns for each sample. For example, for `this VCF `_ ``vcf2wt`` generates the following schema fragment: .. code-block:: xml Each of the ``.DS`` and ``.GL`` columns uses 4 byte floating point values, even though the input values are small with very low precision. In this case, half precision floats are perfect, and save a great deal of space. Changing the variable length columns to fixed length columns again and using 2 byte floats, we get the following schema fragment: .. code-block:: xml Applying these changes to all samples makes a considerable difference: using the default schema, the wormtable datafile is 77GB, but using the modified schema gives us a data file of 34GB. It should be emphasised here that there is no loss of information in this case. All the floating point values in the input VCF have at most three decimal places of precision, which half precision floats can represent exactly. .. _performance-cache: ------------ Cache tuning ------------ Wormtable uses Berkeley DB databases to store the locations of rows in the datafile and to create indexes on columns. An important performance tuning factor is the ``db_cache_size`` for the :class:`Table` and :class:`Index` classes. The ``db_cache_size`` essentially determines how much of these databases is held in memory, and typically, for performance purposes we would like to have the entire database in memory if possible. In many cases, such as a sequential full table scan, a large cache size will make very little difference, so it is not a good idea to have a large cache size by default. There are certain situations, however, when a large db cache is definitely a good idea. When we are building an index, performance can suffer quite badly if sufficient cache is not provided, since Berkeley DB will need to write pages to disk and subsequently read them back. It if therefore a good idea to provide a large cache size when creating an index (several gigabytes is usually a good choice). There is no harm in specifiying a cache size larger than is required, since the ``db_cache_size`` is an upper limit on the amount of memory used. Berkeley DB will only use as much memory as is needed to keep the database in memory. For further information, see the discussion on setting cache sizes for `Berkeley DB `_.