Sampling

The DNASnout client is using sampling will going through input data.

This might be of general interest to people, so we have it easily accessible. We are using a reservoir sampling strategy, which has the following interesting properties:

  • data are iterated through, and there is no need to keep all data in memory
  • at any point during the iteration we have a (random) sample of the data so far

Random sampling

Note

This module originally appeared in dnasnout_client.

In the context of NGS data, reservoir sampling is particularly convenient because it can let one iterate through data and have at each point in time a random sample of the data iterated through at that moment.

# open a FASTQ file
from ngs_plumbing import parsing
filename = "mysequences.fastq"
seq_data = parsing.open(filename)

# create a ReservoirSample for (up to) 1,000 entries
from ngs_plumbing import sampling
nreads = 1000
spl = sampling.ReservoirSample(nreads)
# iterate through the entries (sequence data + quality) in the FASTQ file
for read in seq_data.iter_seqqual():
    spl.add(read)

We have also added a variation of the reservoir sampling using a bloom filter: an item is considered for reservoir sampling only if not already either present in the sample, or has not been seen before (both being tested with a bloom filter).

This sampling class should be used whenever a diversity-based sampling is thought while the sample contains highly duplicated data.

# open a FASTQ file
from ngs_plumbing import parsing
filename = "mysequences.fastq"
seq_data = parsing.open(filename)

# create a ReservoirSample for (up to) 1,000 entries
from ngs_plumbing import sampling
nreads = 1000
spl = sampling.ReservoirSampleBloom(nreads)
# iterate through the entries (sequence data + quality) in the FASTQ file
for read in seq_data.iter_seqqual():
    spl.add(read)

Normalized random sampling

The constructor for ReservoirSampleBloom accepts a named parameter :param:`itemkey` that can be any function taking the kind of item one wishes to be in the sample as a parameter and returning a string. This is designed to let one express easily custom approaches to filtering, or use arbitrary data structures.

For example, the default :param:`itemkey` is designed to work with parsing feature in this package, but it is trivial to make it work with biopython sequences.

def biopython_sequence(item):
    return item.seq

spl = sampling.ReservoirSampleBloom(nreads,
                                    itemkey = biopython_sequence)

An other example is if only the first 50 first bases should be considered for the bloom filter (one reason would be that we do not want to let highly abundant reads fill up the sample because high sequence variablity in the last part of the read is letting them fill up the fill up the sample).

def clipatfifty(item):
    return item.sequence[:50]

spl = sampling.ReservoirSampleBloom(nreads,
                                    itemkey = clipatfifty)

Note

There is an obvious link with the general idea of using streams of entries in pipeline (see Parsing). Sampling is a way to reduce the amount of sequencing data to work with, and here we have a way to build the sample without having to hold all data in memory.

Table Of Contents

Previous topic

FASTA files

Next topic

Working with 1D intervals

This Page