XSQ

Note

This requires the package h5py installed (and that package requires itself the HDF5 library).

The XSQ format for short read data was proposed by Life Technologies and is based on the HDF5 format. Being a binary format it will be a general better efficiency than the likes of FASTQ and CSFASTA + QUAL. In practice, it is only found when looking at the output from SOLiD sequencers of the 5500 series. However, there is to this author’s knowledge close to no tool making use of the XSQ format and it will be required to convert them to other format to perform operations such as the alignment of reads or de-novo assembly.

We are providing here both a library and executable script to handle sequencing data in the XSQ format.

Making reports for XSQ files

An early step after generating sequencing data is to assess their quality.

We provide a script that can generate HTML reports with graphics to quickly give an impression of the read quality.

xsq-report --help

Warning

This is experimental and has not been worked on for a while now. It may or may not work well. As the stream / iterator based tools develop (see Parsing), the whole QC should move to a data format agnostic-module in the package.

Converting XSQ files

The package contains a script that can act as executable, and is installed at the same time than this package is. The executable can convert XSQ files into FASTA-like format (FASTA, or CSFASTA+QUAL), or FASTQ-like formats (FASTQ, should the sequencing run be ECC, or CSFASTQ).

Try:

xsq-convert --help

The script itself is written using this module, and the code should show how other tools to handle the XSQ format could be built.

Using the library

The command-line executables are built from the the library ngs_plumbing.xsq.

Note

Here again, the aim is to move a format-agnostic set of streams / iterators (see Parsing), so the format in which the data are should not matter at all.

XSQFile objects

The XSQFile, as the name sugggests it, models XSQ files. For convenience, and running unit tests, a very small XSQ file is included with the package.

>>> from ngs_plumbing import xsq
>>> test_files = xsq.list_xsq(xsq.testdata_dir)

Creating an XSQ objects from an existing file is easy:

>>> xf = xsq.XSQFile(test_files.next())

Libraries

At the root of the XSQ, data are split into libraries. If no multiplexing (barcode tagging) of samples was performed there will be only one library (called ‘DefaultLibrary’ by default).

>>> print(xf.library_names)
(u'DefaultLibrary',)

A library can accessed with the method XSQFile.library()

>>> lb = xf.library('DefaultLibrary')
>>> print(lb.name)

A number of operations can be performed:

  • number of reads in the library
    >>> print(lb.readcount())
    

Behind the hood, an XSQLibrary inherits from h5py.Group and the underlying methods remain accessible.

When facing an XSQ file with so-called ECC data (so far direct reads-only), data can be extracted into the FASTAQ format. The sequencing software linked to the machine can be instructed of the presence of barcodes when the run is started, and the assignation of the reads into buckets corresponding to barcodes is done by the software extracting data from the images taken of the flowcell. An easy entry point is the function ngs_plumbing.xsq.automagic_fasta(), extracting data and splitting them into different FASTAQ files if barcodes were used.

Module

class ngs_plumbing.xsq.ColourQual

ColourQual(colour, qual)

colour

Alias for field number 0

qual

Alias for field number 1

exception ngs_plumbing.xsq.DataError[source]
exception ngs_plumbing.xsq.FragmentError[source]
exception ngs_plumbing.xsq.InvalidXSQError[source]
class ngs_plumbing.xsq.XSQFile(name, mode=None, driver=None, libver=None, userblock_size=None, **kwds)[source]

Sequencing data from a run

iter_lib()[source]
library(name, unclassified=False, unassigned=False)[source]
library_names

Names of the libraries in the file (Ignoring the unclassifed and unassigned groups)

run_metadata

Metadata about the run

class ngs_plumbing.xsq.XSQInfo[source]
classmethod fromxsqfile_toordict(xsqfile)[source]
class ngs_plumbing.xsq.XSQLibrary(bind)[source]

Sequenced library in an XSQ file.

fragments()[source]

Fragments sequenced in this library

iter_colourqual(fragment)[source]

Iterator over the sequence reads for the fragment ‘fragment’. Each iteration returns a named tuple of length 2:

  • colour: the sequence of colours
  • qual: the quality for the sequence

For a naive colour-to-sequence translation, the first base for the fragment should be used. It is in the dict FRAGMENT_START.

iter_dnaqual(fragment, numbase)[source]

Iterator over the sequence reads for the fragment ‘fragment’. Each iteration returns a tuple of length 2:

  • the sequence
  • the quality
iter_reads(fragment, what='BaseCallQV')[source]

Iterator over the fragment ‘fragment’ for the data ‘what’

name

Library Name

readcount()[source]

Return the number of reads for the library

class ngs_plumbing.xsq.XSQRunMetadata(bind)[source]

Details about a sequencing run in an XSQ file.

A run can have several libraries sequenced at once.

fileversion

Version of the XSQ file

hdfversion

Version of the HDF5 file

lanenumber

Lane number

librarytype

Library type

runname

Name for the run

sequencingsampledescription

Sequencing sample description

sequencingsamplename

Sequencing sample name

tagdetails

Tag details

class ngs_plumbing.xsq.XSQTagDetails(bind)[source]
isbasepresent

Is base information present ? When running a SOLiD, True likely means that ECC was run.

numbasecalls

Number of bases called for the tag (“read length” for the tag)

class ngs_plumbing.xsq.XSQTagDetailsList(bind)[source]
tags

Sequencing tags. F3 is typically the forward strand for single read sequencing.

ngs_plumbing.xsq.automagic_csfasta(filename, path_out='.', fragment=None, buf_size=2000000, out=<_io.TextIOWrapper name='<stdout>' mode='w' encoding='UTF-8'>, want_unassigned=False)[source]

Extract sequence-space data into a FASTAQ file

ngs_plumbing.xsq.automagic_fastq(filename, path_out='.', fragment=None, buf_size=2000000, extension='fq', out=<_io.TextIOWrapper name='<stdout>' mode='w' encoding='UTF-8'>, cs=False, want_unassigned=False)[source]

Extract sequence-space data into a FASTAQ file

ngs_plumbing.xsq.fastqual(xsqlib, fragment, what, sample=0.05)[source]

Perform a FASTQUAL-type QC. - xsqlibn: the XSQ library - fragment: name of the frament - what: colour-space or base-space data - sample: proportion to sample in order to perform the QC; 1 is 100%

ngs_plumbing.xsq.iter_csfasta_reads(group, fragment, numbase)[source]

Iterator over reads in base space

  • group: HDF5 group
  • fragment: fragment name
  • numbase: number of bases (read length)
ngs_plumbing.xsq.iter_fastq_reads(group, fragment, numbase)[source]

Iterator over reads in base space

  • group: HDF5 group
  • fragment: fragment name
  • numbase: number of bases (read length)
ngs_plumbing.xsq.iter_reads(group, fragment, what='BaseCallQV')[source]

Iterator over the fragment ‘fragment’ for the data ‘what’

ngs_plumbing.xsq.iter_valuearrays(group, fragment, what='BaseCallQV')[source]

Iterator over the fragment ‘fragment’ for the data ‘what’

ngs_plumbing.xsq.list_xsq(path)[source]

list XSQ (.xsq) files in a given directory

ngs_plumbing.xsq.make_csfasta(lib, fragment, buf_cs, buf_qual, f, flowcell, numbase=None, progress_mark=50000, out=<_io.TextIOWrapper name='<stdout>' mode='w' encoding='UTF-8'>)[source]
ngs_plumbing.xsq.make_csfastq(lib, fragment, buf, f, flowcell, numbase=None, progress_mark=50000, out=<_io.TextIOWrapper name='<stdout>' mode='w' encoding='UTF-8'>)[source]
ngs_plumbing.xsq.make_fastq(lib, fragment, buf, f, flowcell, numbase=None, progress_mark=50000, out=<_io.TextIOWrapper name='<stdout>' mode='w' encoding='UTF-8'>)[source]
ngs_plumbing.xsq.make_htmlreport(xsqfile, directory='.', verbose=True, sample_percent=0.05, data_type='json')[source]
ngs_plumbing.xsq.seqid_template(f, flowcell, tile)[source]
ngs_plumbing.xsq.seqid_template_csfasta(flowcell, fragment)[source]
ngs_plumbing.xsq.tile_valuearray(tile, fragment, what)[source]

Get the array of values for a tile.

Table Of Contents

Previous topic

Working with 1D intervals

Next topic

Color space

This Page