twobitreader
Licensed under Perl Artistic License 2.0 No warranty is provided, express or implied
python-level reader for .2bit files (i.e., from UCSC genome browser) (note: no writing support)
TwoBitFile inherits from dict You may access sequences by name, e.g. >>> genome = TwoBitFile(‘hg18.2bit’) >>> chr20 = genome[‘chr20’]
Sequences are returned as TwoBitSequence objects You may access intervals by slicing or using str() to dump the entire entry e.g. >>> chr20[100100:100120] ‘ttttcctctaagataatttttgccttaaatactattttgttcaatactaagaagtaagataacttccttttgttggta tttgcatgttaagtttttttcc’ >>> whole_chr20 = str(chr20)
Fair warning: dumping the entire chromosome requires a lot of memory
See TwoBitSequence for more info
A TwoBitSequence object refers to an entry in a TwoBitFile
You may access intervals by slicing or using str() to dump the entire entry e.g. >>> genome = TwoBitFile(‘hg18.2bit’) >>> chr20 = genome[‘chr20’] >>> chr20[100100:100200] # slicing returns a string ‘ttttcctctaagataatttttgccttaaatactattttgttcaatactaagaagtaagataacttccttttgttggta tttgcatgttaagtttttttcc’ >>> whole_chr20 = str(chr20) # get whole chr as string
Fair warning: dumping the entire chromosome requires a lot of memory
Note that we follow python/UCSC conventions: Coordinates are 0-based, end-open (Note: The UCSC web-based genome browser uses 1-based closed coordinates) If you attempt to access a slice past the end of the sequence, it will be truncated at the end.
Your computer probably doesn’t have enough memory to load a whole genome but if you want to string-ize your TwoBitFile, here’s a recipe:
x = TwoBitFile(‘my.2bit’) d = x.dict() for k,v in d.iteritems(): d[k] = str(v)
provided for user convenience convert a nucleotide to its bit representation
cmdline_reader allows twobitreader module to be executed as a script accepts only one argument – the .2bit filename reads input (BED format) from stdin writes output (FASTA format) to stdout writes errors/warning to stderr
Regions should be given in BED format on stdin chrom start(0-based) end(0-based, not included)
To use a BED file of regions, do python -m twobitreader example.2bit < example.bed
Non-regions will be skipped and warnings will be issued to logging (logging output to stderr by default)
Prints the twoBit file format specification I got from the Internet. This is only here for reference
split a 16-bit number into integer representation of its course and fine parts in binary representation
OS X uses an 8-byte long, so make sure L (long) is the right size and switch to I (int) if needed
twobit_reader takes a twobit_file (of class TwoBitFile) and an “input_stream” which can be any iterable (incl. file-like objects) writes output (FASTA format) using write (print if write=None) logs errors/warning to stderr
Regions should be given in BED format on stdin chrom start(0-based) end(0-based, not included)
To use a BED file of regions, do python -m twobitreader example.2bit < example.bed
Non-regions will be skipped and warnings will be issued to logging (logging output to stderr by default)