Compression¶

The HDF5 libraries and the h5py module support transparent compression of data in HDF5 files.

The use of compression can sometimes drastically reduce file size, often makes it faster to read the data from the file, and sometimes makes it faster to write the data. Though, not all data compresses very well and can occassionally end up larger after compression than it was uncompressed. Compression does cost CPU time both when compressing the data and when decompressing it. The reason this can sometimes lead to faster read and write times is because disks are very slow and the space savings can save enough disk access time to make up for the CPU time.

All versions of this package can read compressed data, but not all versions can write compressed data.

New in version 0.1.9: HDF5 write compression features added along with several options to control it in Options.

New in version 0.1.7: Options will take the compression options but ignores them.

Warning

Passing the compression options for versions earlier than 0.1.7 will result in an error.

Enabling Compression¶

Compression, which is enabled by default, is controlled by setting Options.compress to True or passing compress=X to write() and savemat() where X is True or False.

Note

Not all python objects written to the HDF5 file will be compressed, or even support compression. For one, numpy scalars or any type that is stored as one do not support compression due to limitations of the HDF5 library, though compressing them would be a waste (hence the lack of support).

Setting The Minimum Data Size for Compression¶

Compressing small pieces of data often wastes space (compressed size is larger than uncompressed size) and CPU time. Due to this, python objects have to be larger than a particular size before this package will compress them. The threshold, in bytes, is controlled by setting Options.compress_size_threshold or passing compress_size_threshold=X to write() and savemat() where X is a non-negative integer. The default value is 16 KB.

Controlling The Compression Algorithm And Level¶

Many compression algorithms can be used with HDF5 files, though only three are common. The Deflate algorithm (sometimes known as the GZIP algorithm), LZF algorithm, and SZIP algorithms are the algorithms that the HDF5 library is explicitly setup to support. The library has a mechanism for adding additional algorithms. Popular ones include the BZIP2 and BLOSC algorithms.

The compression algorithm used is controlled by setting Options.compression_algorithm or passing compression_algorithm=X to write() and savemat(). X is the str name of the algorithm. The default is 'gzip' corresponding to the Deflate/GZIP algorithm.

Note

As of version 0.2, only the Deflate (X = 'gzip'), LZF (X = 'lzf'), and SZIP (X = 'szip') algorithms are supported.

Note

If doing MATLAB compatibility (Options.matlab_compatible is True), only the Deflate algorithm is supported.

The algorithms, in more detail

GZIP / Deflate ('gzip'): The common Deflate algorithm seen in the Unix and Linux gzip utility and the most common compression algorithm used in ZIP files. It is the most compatible algorithm. It achieves good compression and is reasonably fast. It has no patent or license restrictions.
LZF ('lzf'): A very fast algorithm but with inferior compression to GZIP/Deflate. It is less commonly used than GZIP/Deflate, but similarly has no patent or license restrictions.
SZIP ('szip'): This compression algorithm isn’t always available and has patent and license restrictions. See SZIP License.

If GZIP/Deflate compression is being used, the compression level can be adjusted by setting Options.gzip_compression_level or passing gzip_compression_level=X to write() and savemat() where X is an integer between 0 and 9 inclusive. 0 is the lowest compression, but is the fastest. 9 gives the best compression, but is the slowest. The default is 7.

For all compression algorithms, there is an additional filter which can help achieve better compression at relatively low cost in CPU time. It is the shuffle filter. It is controlled by setting Options.shuffle_filter or passing shuffle_filter=X to write() and savemat() where X is True or False. The default is True.

Using Checksums¶

Fletcher32 checksums can be calculated and stored for most types of stored data in an HDF5 file. These are then checked when the data is read to catch file corruption, which will cause an error when reading the data informing the user that there is data corruption. The filter can be enabled or disabled separately for data that is compressed and data that is not compressed (e.g. compression is disabled, the python object can’t be compressed, or the python object’s data size is smaller than the compression threshold).

For compressed data, it is controlled by setting Options.compressed_fletcher32_filter or passing compressed_fletcher32_filter=X to write() and savemat() where X is True or False. The default is True.

For uncompressed data, it is controlled by setting Options.uncompressed_fletcher32_filter or passing uncompressed_fletcher32_filter=X to write() and savemat() where X is True or False. The default is False.

Note

Fletcher32 checksums are not computed for anything that is stored as a numpy scalar.

Chunking¶

When no filters are used (compression and Fletcher32), this package stores data in HDF5 files in a contiguous manner. The use of any filter requires that the data use chunked storage. Chunk sizes are determined automatically using the autochunk feature of h5py. The HDF5 libraries make reading contiguous and chunked data transparent, though access speeds can differ and the chunk size affects the compression ratio.

Compression¶

Enabling Compression¶

Setting The Minimum Data Size for Compression¶

Controlling The Compression Algorithm And Level¶

Using Checksums¶

Chunking¶

Further Reading¶

Table Of Contents

Previous topic

Next topic

This Page