Note
This requires the package h5py installed (and that package requires itself the HDF5 library).
The XSQ format for short read data was proposed by Life Technologies and is based on the HDF5 format. Being a binary format it will be a general better efficiency than the likes of FASTQ and CSFASTA + QUAL. In practice, it is only found when looking at the output from SOLiD sequencers of the 5500 series. However, there is to this author’s knowledge close to no tool making use of the XSQ format and it will be required to convert them to other format to perform operations such as the alignment of reads or de-novo assembly.
We are providing here both a library and executable script to handle sequencing data in the XSQ format.
An early step after generating sequencing data is to assess their quality.
We provide a script that can generate HTML reports with graphics to quickly give an impression of the read quality.
xsq-report --help
Warning
This is experimental and has not been worked on for a while now. It may or may not work well. As the stream / iterator based tools develop (see Parsing), the whole QC should move to a data format agnostic-module in the package.
The package contains a script that can act as executable, and is installed at the same time than this package is. The executable can convert XSQ files into FASTA-like format (FASTA, or CSFASTA+QUAL), or FASTQ-like formats (FASTQ, should the sequencing run be ECC, or CSFASTQ).
Try:
xsq-convert --help
The script itself is written using this module, and the code should show how other tools to handle the XSQ format could be built.
The command-line executables are built from the the library ngs_plumbing.xsq.
Note
Here again, the aim is to move a format-agnostic set of streams / iterators (see Parsing), so the format in which the data are should not matter at all.
The XSQFile, as the name sugggests it, models XSQ files. For convenience, and running unit tests, a very small XSQ file is included with the package.
>>> from ngs_plumbing import xsq
>>> test_files = xsq.list_xsq(xsq.testdata_dir)
Creating an XSQ objects from an existing file is easy:
>>> xf = xsq.XSQFile(test_files.next())
At the root of the XSQ, data are split into libraries. If no multiplexing (barcode tagging) of samples was performed there will be only one library (called ‘DefaultLibrary’ by default).
>>> print(xf.library_names)
(u'DefaultLibrary',)
A library can accessed with the method XSQFile.library()
>>> lb = xf.library('DefaultLibrary')
>>> print(lb.name)
A number of operations can be performed:
>>> print(lb.readcount())
Behind the hood, an XSQLibrary inherits from h5py.Group and the underlying methods remain accessible.
When facing an XSQ file with so-called ECC data (so far direct reads-only), data can be extracted into the FASTAQ format. The sequencing software linked to the machine can be instructed of the presence of barcodes when the run is started, and the assignation of the reads into buckets corresponding to barcodes is done by the software extracting data from the images taken of the flowcell. An easy entry point is the function ngs_plumbing.xsq.automagic_fasta(), extracting data and splitting them into different FASTAQ files if barcodes were used.
ColourQual(colour, qual)
Alias for field number 0
Alias for field number 1
Sequencing data from a run
Names of the libraries in the file (Ignoring the unclassifed and unassigned groups)
Metadata about the run
Sequenced library in an XSQ file.
Iterator over the sequence reads for the fragment ‘fragment’. Each iteration returns a named tuple of length 2:
For a naive colour-to-sequence translation, the first base for the fragment should be used. It is in the dict FRAGMENT_START.
Iterator over the sequence reads for the fragment ‘fragment’. Each iteration returns a tuple of length 2:
Iterator over the fragment ‘fragment’ for the data ‘what’
Library Name
Details about a sequencing run in an XSQ file.
A run can have several libraries sequenced at once.
Version of the XSQ file
Version of the HDF5 file
Lane number
Library type
Name for the run
Sequencing sample description
Sequencing sample name
Tag details
Is base information present ? When running a SOLiD, True likely means that ECC was run.
Number of bases called for the tag (“read length” for the tag)
Sequencing tags. F3 is typically the forward strand for single read sequencing.
Extract sequence-space data into a FASTAQ file
Extract sequence-space data into a FASTAQ file
Perform a FASTQUAL-type QC. - xsqlibn: the XSQ library - fragment: name of the frament - what: colour-space or base-space data - sample: proportion to sample in order to perform the QC; 1 is 100%
Iterator over reads in base space
Iterator over reads in base space
Iterator over the fragment ‘fragment’ for the data ‘what’
Iterator over the fragment ‘fragment’ for the data ‘what’