XSQ === .. note:: This requires the package :py:mod:`h5py` installed (and that package requires itself the HDF5 library). .. :py:module:: ngs_plumbing.xsq The XSQ format for short read data was proposed by Life Technologies and is based on the HDF5 format. Being a binary format it will be a general better efficiency than the likes of FASTQ and CSFASTA + QUAL. In practice, it is only found when looking at the output from SOLiD sequencers of the 5500 series. However, there is to this author's knowledge close to no tool making use of the XSQ format and it will be required to convert them to other format to perform operations such as the alignment of reads or de-novo assembly. We are providing here both a library and executable script to handle sequencing data in the XSQ format. Making reports for XSQ files ----------------------------- An early step after generating sequencing data is to assess their quality. We provide a script that can generate HTML reports with graphics to quickly give an impression of the read quality. .. code-block:: bash xsq-report --help .. warning:: This is experimental and has not been worked on for a while now. It may or may not work well. As the stream / iterator based tools develop (see :ref:`section-parsing`), the whole QC should move to a data format agnostic-module in the package. Converting XSQ files -------------------- The package contains a script that can act as executable, and is installed at the same time than this package is. The executable can convert XSQ files into FASTA-like format (FASTA, or CSFASTA+QUAL), or FASTQ-like formats (FASTQ, should the sequencing run be ECC, or CSFASTQ). Try: .. code-block:: bash xsq-convert --help The script itself is written using this module, and the code should show how other tools to handle the XSQ format could be built. Using the library ----------------- The command-line executables are built from the the library :mod:`ngs_plumbing.xsq`. .. note:: Here again, the aim is to move a format-agnostic set of streams / iterators (see :ref:`section-parsing`), so the format in which the data are should not matter at all. XSQFile objects ^^^^^^^^^^^^^^^ The :class:`XSQFile`, as the name sugggests it, models XSQ files. For convenience, and running unit tests, a very small XSQ file is included with the package. >>> from ngs_plumbing import xsq >>> test_files = xsq.list_xsq(xsq.testdata_dir) Creating an XSQ objects from an existing file is easy: >>> xf = xsq.XSQFile(test_files.next()) Libraries ^^^^^^^^^ At the root of the XSQ, data are split into libraries. If no multiplexing (barcode tagging) of samples was performed there will be only one library (called 'DefaultLibrary' by default). >>> print(xf.library_names) (u'DefaultLibrary',) A library can accessed with the method :meth:`XSQFile.library` >>> lb = xf.library('DefaultLibrary') >>> print(lb.name) A number of operations can be performed: - number of reads in the library >>> print(lb.readcount()) Behind the hood, an :class:`XSQLibrary` inherits from :py:class:`h5py.Group` and the underlying methods remain accessible. When facing an XSQ file with so-called ECC data (so far direct reads-only), data can be extracted into the FASTAQ format. The sequencing software linked to the machine can be instructed of the presence of barcodes when the run is started, and the assignation of the reads into buckets corresponding to barcodes is done by the software extracting data from the images taken of the flowcell. An easy entry point is the function :py:func:`ngs_plumbing.xsq.automagic_fasta`, extracting data and splitting them into different FASTAQ files if barcodes were used. Module ------ .. automodule:: ngs_plumbing.xsq :members: :undoc-members: