.. currentmodule:: hdf5storage .. _Compression: =========== Compression =========== The HDF5 libraries and the :py:mod:`h5py` module support transparent compression of data in HDF5 files. The use of compression can sometimes drastically reduce file size, often makes it faster to read the data from the file, and sometimes makes it faster to write the data. Though, not all data compresses very well and can occassionally end up larger after compression than it was uncompressed. Compression does cost CPU time both when compressing the data and when decompressing it. The reason this can sometimes lead to faster read and write times is because disks are very slow and the space savings can save enough disk access time to make up for the CPU time. All versions of this package can read compressed data, but not all versions can write compressed data. .. versionadded:: 0.1.9 HDF5 write compression features added along with several options to control it in :py:class:`Options`. .. versionadded:: 0.1.7 :py:class:`Options` will take the compression options but ignores them. .. warning:: Passing the compression options for versions earlier than ``0.1.7`` will result in an error. Enabling Compression ==================== Compression, which is enabled by default, is controlled by setting :py:attr:`Options.compress` to ``True`` or passing ``compress=X`` to :py:func:`write` and :py:func:`savemat` where ``X`` is ``True`` or ``False``. .. note:: Not all python objects written to the HDF5 file will be compressed, or even support compression. For one, :py:mod:`numpy` scalars or any type that is stored as one do not support compression due to limitations of the HDF5 library, though compressing them would be a waste (hence the lack of support). Setting The Minimum Data Size for Compression ============================================= Compressing small pieces of data often wastes space (compressed size is larger than uncompressed size) and CPU time. Due to this, python objects have to be larger than a particular size before this package will compress them. The threshold, in bytes, is controlled by setting :py:attr:`Options.compress_size_threshold` or passing ``compress_size_threshold=X`` to :py:func:`write` and :py:func:`savemat` where ``X`` is a non-negative integer. The default value is 16 KB. Controlling The Compression Algorithm And Level =============================================== Many compression algorithms can be used with HDF5 files, though only three are common. The Deflate algorithm (sometimes known as the GZIP algorithm), LZF algorithm, and SZIP algorithms are the algorithms that the HDF5 library is explicitly setup to support. The library has a mechanism for adding additional algorithms. Popular ones include the BZIP2 and BLOSC algorithms. The compression algorithm used is controlled by setting :py:attr:`Options.compression_algorithm` or passing ``compression_algorithm=X`` to :py:func:`write` and :py:func:`savemat`. ``X`` is the ``str`` name of the algorithm. The default is ``'gzip'`` corresponding to the Deflate/GZIP algorithm. .. note:: As of version ``0.2``, only the Deflate (``X = 'gzip'``), LZF (``X = 'lzf'``), and SZIP (``X = 'szip'``) algorithms are supported. .. note:: If doing MATLAB compatibility (:py:attr:`Options.matlab_compatible` is ``True``), only the Deflate algorithm is supported. The algorithms, in more detail GZIP / Deflate (``'gzip'``) The common Deflate algorithm seen in the Unix and Linux ``gzip`` utility and the most common compression algorithm used in ZIP files. It is the most compatible algorithm. It achieves good compression and is reasonably fast. It has no patent or license restrictions. LZF (``'lzf'``) A very fast algorithm but with inferior compression to GZIP/Deflate. It is less commonly used than GZIP/Deflate, but similarly has no patent or license restrictions. SZIP (``'szip'``) This compression algorithm isn't always available and has patent and license restrictions. See `SZIP License `_. If GZIP/Deflate compression is being used, the compression level can be adjusted by setting :py:attr:`Options.gzip_compression_level` or passing ``gzip_compression_level=X`` to :py:func:`write` and :py:func:`savemat` where ``X`` is an integer between ``0`` and ``9`` inclusive. ``0`` is the lowest compression, but is the fastest. ``9`` gives the best compression, but is the slowest. The default is ``7``. For all compression algorithms, there is an additional filter which can help achieve better compression at relatively low cost in CPU time. It is the shuffle filter. It is controlled by setting :py:attr:`Options.shuffle_filter` or passing ``shuffle_filter=X`` to :py:func:`write` and :py:func:`savemat` where ``X`` is ``True`` or ``False``. The default is ``True``. Using Checksums =============== Fletcher32 checksums can be calculated and stored for most types of stored data in an HDF5 file. These are then checked when the data is read to catch file corruption, which will cause an error when reading the data informing the user that there is data corruption. The filter can be enabled or disabled separately for data that is compressed and data that is not compressed (e.g. compression is disabled, the python object can't be compressed, or the python object's data size is smaller than the compression threshold). For compressed data, it is controlled by setting :py:attr:`Options.compressed_fletcher32_filter` or passing ``compressed_fletcher32_filter=X`` to :py:func:`write` and :py:func:`savemat` where ``X`` is ``True`` or ``False``. The default is ``True``. For uncompressed data, it is controlled by setting :py:attr:`Options.uncompressed_fletcher32_filter` or passing ``uncompressed_fletcher32_filter=X`` to :py:func:`write` and :py:func:`savemat` where ``X`` is ``True`` or ``False``. The default is ``False``. .. note:: Fletcher32 checksums are not computed for anything that is stored as a :py:mod:`numpy` scalar. Chunking ======== When no filters are used (compression and Fletcher32), this package stores data in HDF5 files in a contiguous manner. The use of any filter requires that the data use chunked storage. Chunk sizes are determined automatically using the autochunk feature of :py:mod:`h5py`. The HDF5 libraries make reading contiguous and chunked data transparent, though access speeds can differ and the chunk size affects the compression ratio. Further Reading =============== .. seealso:: `HDF5 Datasets Filter pipeline `_ Description of the Dataset filter pipeline in the :py:mod:`h5py` `Using Compression in HDF5 `_ FAQ on compression from the HDF Group. `HDF5 Tutorial: Learning The Basics: Dataset Storage Layout `_ Information on Dataset storage format from the HDF Group `SZIP License `_ The license for using the SZIP compression algorithm. `SZIP COMPRESSION IN HDF PRODUCTS `_ Information on using SZIP compression from the HDF Group. `3rd Party Compression Algorithms for HDF5 `_ List of common additional compression algorithms.