.. Created by phyles-quickstart. Add some items to the toctree. tfasta ====== Introduction ------------ The tfasta python package simplifies working with fasta, providing functionality for both reading and writing fasta files. The "t" in "tfasta" represents "templated", which means that fasta parsing is performed according to pre-defined or user-defined templates:: >>> from tfasta import fasta_parser, T_NR >>> fast = fasta_parser("cytb.fas", template=T_NR) >>> f = fast.next() >>> print f['gi'] 5524211 >>> print f['accession'] AAD44166.1 >>> print f['sequence'][:60] LCLYTHIGRNIYYGSYLYSETWNTGIMLLLITMATAFMGYVLPWGQMSFWGATVITNLFS This example parses records that follow the conventions of the NCBI non-redundant database (nr). More examples are given below. Installation ------------ Install tfasta with `pip`_ (recommended) or `easy_install`_:: sudo pip install tfasta Optionally, download the source files from http://pypi.python.org/pypi/tfasta/ and run the following commands in the source directory:: python setup.py build sudo python setup.py install Home Page & Repository ---------------------- - Home Page: http://pypi.python.org/pypi/tfasta/ - Documentation: http://pythonhosted.org/radialx/ - Repository: https://github.com/jcstroud/tfasta/ Usage ----- Reading Fasta Files ~~~~~~~~~~~~~~~~~~~ Reading fasta files is performed with the *fasta_parser()* function. The following text is the first 2 records of the file "``short-extended.fas``":: >gi|32033604|ref|ZP_00133915.1| ATGQVIGTFTVRNDNGLHARPSAVLVQTLKPFAAKVTVENLDRGTAPANAKSTMKVVALG ASQAHRLRFVAEGEDAQQAIEALAKAFVEGLGESVSFVPAVEDTIEGAAQPQAVESAKNF ANPTASEPTVEGQVEGTFVIQNEHGLHARPSAVLVNEVKKYNATIVVQNLDRNTQLVSAK SLMKIVALGVVKGHRLHFVATGDDAQKAIDGIGEAIAAGLGE >gi|1573424|gb|AAC22107.1| VEGAVVGTFTIRNEHGLHARPSANLVNEVKKFTSKITMQNLTRESEVVSAKSLMKIVALG VTQGHRLRFVAEGEDAKQAIESLGKAIANGLGENVSAVPPSEPDTIEIMGDQIHTPAVTE DDNLPANAIEAVFVIKNEQGLHARPSAILVNEVKKYNASVAVQNLDRNSQLVSAKSLMKI VALGVVKGTRLRFVATGEEAQQAIDGIGAVIESGLGE Like any other fasta file, ``short-extended.fas`` may be parsed with a single command:: fast = fasta_parser(file_name) For example:: >>> from tfasta import fasta_parser >>> fast = fasta_parser("short-extended.fas") >>> f = fast.next() >>> print f['name'] gi|32033604|ref|ZP_00133915.1| >>> print f['sequence'][:60] ATGQVIGTFTVRNDNGLHARPSAVLVQTLKPFAAKVTVENLDRGTAPANAKSTMKVVALG >>> f = fast.next() >>> print f['name'] gi|1573424|gb|AAC22107.1| In this example, the *fasta_parser()* function returns an iterator of dictionaries ("``fast``") with two keys: ``name`` and ``sequence``. The ``name`` key corresponds to all of the plain text after the fasta format marker "``>``" that marks a new sequence. Iteration ~~~~~~~~~ The iterator returned by the *fasta_parser()* function may serve in ``for`` loops:: >>> from tfasta import fasta_parser >>> for f in fasta_parser("short-extended.fas"): ... print f['name'] gi|32033604|ref|ZP_00133915.1| gi|1573424|gb|AAC22107.1| [...] Using Templates ~~~~~~~~~~~~~~~ tfasta will parse fasta files according to one of several templates that are written to the conventions of several common fasta file sources, like swissprot, pdb, nr (the non-redundant database), and nrblast. The latter example implicitly uses a the default template called "``T_DEF``", yielding dictionaries with the keys ``name`` and ``sequence``. The ``sequence`` key is universal to all templates. A complet list of templates (along with their keys) provided by tfasta are: - ``T_DEF`` - plain old fasta line - ``name`` - everything after the ">" - ``T_SWISS`` - fasta files from swissprot - ``gi_num`` - between first set of "|"s - ``accession`` - between 3rd and 4th "|" - ``description`` - after last "|" - ``T_PDB`` - the fasta file of the entire pdb - ``idCode`` - first four characters after ">" - ``chainID`` - any non-whitespace characters after first "_" - ``type`` - non-whitespace immediately following first ":" - ``numRes`` - numbers immediatly following first ":" - ``description`` - stripped characters after numRes - ``T_NR`` - the protein non-redundant database - ``gi`` - between first set of "|"s - ``accession`` - between 3rd and 4th "|" - ``description`` - stripped characters before brackets - ``source`` - stripped characters inside brackets - ``T_NRBLAST`` - fasta file produced from blast output of the nr - ``gi`` - between first set of "|"s - ``accession`` - between 3rd and 4th "|" An example using the ``T_NR`` template follows. The nr fasta database has records that look like this:: >gi|5524211|gb|AAD44166.1| cytochrome b [Elephas maximus maximus] LCLYTHIGRNIYYGSYLYSETWNTGIMLLLITMATAFMGYVLPWGQMSFWGATVITNLFSAIPYIGTNLV EWIWGGFSVDKATLNRFFAFHFILPFTMVALAGVHLTFLHETGSNNPLGLTSDSDKIPFHPYYTIKDFLG LLILILLLLLLALLSPDMLGDPDNHMPADPLNTPLHIKPEWYFLFAYAILRSVPNKLGGVLALFLSIVIL GLMPFLHTSKHRSMMLRPLSQALFWTLTMDLLTLTWIGSQPVEYPYTIIGQMASILYFSIILAFLPIAGX IENY This file, named "``cytb.fas``", may be parsed using the tfasta ``T_NR`` template:: >>> from tfasta import fasta_parser >>> fast = fasta_parser("cytb.fas", template=T_NR) >>> f = fast.next() >>> print f['gi'] 5524211 >>> print f['accession'] AAD44166.1 >>> print f['description'] cytochrome b >>> print f['source'] Elephas maximus maximus >>> print f['sequence'][:60] LCLYTHIGRNIYYGSYLYSETWNTGIMLLLITMATAFMGYVLPWGQMSFWGATVITNLFS Reading Small Fasta Files ~~~~~~~~~~~~~~~~~~~~~~~~~ Some fasta files may be several gigabytes in size. For this reason, *fasta_parser()* reads fasta files incrementally by default, such that only one sequence is read from the file at a time. It is possible to force tfasta to read entire fasta files at once by setting the ``greedy`` keyword to ``True``: fast = fasta_parser("cytb.fas", template=T_NR, greedy=True) Greedy reading of files is better suited for situations where the user does not want to keep the file fasta file open for reading. It is not recommended for large files. Preserving Gaps ~~~~~~~~~~~~~~~ Some sequences may have gaps ("-") such as those originating from sequence alignments. To preserve gaps in the sequence set the ``dogaps`` keyword to ``True``:: fast = fasta_parser("alignment.fas", dogaps=True) Reading Fasta Strings ~~~~~~~~~~~~~~~~~~~~~ Not all fasta will originate from text files. Often, it is necessary to parse a python ``str`` directly using *string_fasta_parser()*:: f = string_fasta_parser(astr) For example:: >>> from tfasta import string_fasta_parser >>> astr = """ > Abeta in a string DAEFRHDSGYEVHHQKLVFFAEDVGSNKGAIIGLMVGGVVIA """ >>> fast = string_fasta_parser(astr) >>> f = fast.next() >>> print f['name'] Abeta in a string >>> print f['sequence'] DAEFRHDSGYEVHHQKLVFFAEDVGSNKGAIIGLMVGGVVIA The *string_fasta_parser()* function takes all of the same arguments as *fasta_parser()* except for the ``greedy`` keyword argument. Creating Fasta ~~~~~~~~~~~~~~ Fasta formatted text can be created one at a time or in bunches with the functions *make_fasta()* and *make_fasta_from_dict()*. The following example creates fasta text from a name and unformatted sequence:: >>> from tfasta import make_fasta >>> name = "OVAX_CHICK GENE X PROTEIN N-Term" >>> seq = """ ... QIKDLLVSSSTDLDTTLVLVNAIYFKGMWK ... afnaedtrempfhvtkqeskpvqmmcmnnsfnvatlpae ... KMKILELPFASGDLSMLV ... """ >>> print make_fasta(name, seq, width=50) >OVAX_CHICK GENE X PROTEIN N-Term QIKDLLVSSSTDLDTTLVLVNAIYFKGMWKAFNAEDTREMPFHVTKQESK PVQMMCMNNSFNVATLPAEKMKILELPFASGDLSMLV The default width of a *make_fasta()* formatted fasta sequence is 60. In this latter example, the width is changed to 50 using the ``width`` keyword. This next example creates fasta from an ordered dictionary (available at https://pypi.python.org/pypi/ordereddict/ or built in to Python 2.7+):: >>> from tfasta import make_fasta_from_dict >>> from collections import OrderedDict # python 2.7+ >>> od = OrderedDict({ "First 10": " 1 QIKDLLVSSS", ... "Second 10": "11 tdldttlvlv" }) print make_fasta_from_dict(od) >First 10 QIKDLLVSSS >Second 10 TDLDTTLVLV Using an ordered dictionary is not required, but it does ensure control over the order of the sequences in the fasta. Any well-behaved "mapping object", like the built-in ``dict`` will work. Note that *make_fasta()* and *make_fasta_from_dict()* both ignore all characters except letters and "-". Creating Templates ~~~~~~~~~~~~~~~~~~ The creation of templates uses python regular expressions to find fields in the first line of a sequence record within a fasta file (the line that begins with ">"). Each field must be "saved" by the regular expression by wrapping its sub-expression in parentheses. For example the ``T_NRBLAST`` template regular expression is:: ^>gi\|([^|]*)\|[^|]*\|([^|]*)\|.*$ ~~~~~ ~~~~~ Here, the sub-expressions underscored by ``~`` characters are saved by virtue of the surrounding parentheses. The keys by which these saved fields are referred in the dictionaries are given names, in the order that they occur in the regular expression:: regex = r'^>gi\|([^|]*)\|[^|]*\|([^|]*)\|.*$' fields = ("gi", "accession") The next step is to create a template using the *FastaTemplate* class:: t_nrblast = FastaTemplate(regex, fields) Here the template ``t_nrblast`` functions exactly as the tfasta provided template ``T_NRBLAST``. Finally, use the template to parse fasta:: fasta_parser(file_name, template=t_nrblast) The following examples shows templating put all together:: >>> from tfasta import fasta_parser, FastaTemplate >>> regex = r'^>gi\|([^|]*)\|[^|]*\|([^|]*)\|.*$' >>> fields = ("gi", "accession") >>> t_nrblast = FastaTemplate(regex, fields) >>> fast = fasta_parser("short-extended.fas", template=t_nrblast) >>> f = fast.next() >>> print f['accession'] ZP_00133915.1 >>> print f['sequence'][:60] ATGQVIGTFTVRNDNGLHARPSAVLVQTLKPFAAKVTVENLDRGTAPANAKSTMKVVALG .. _`pip`: http://www.pip-installer.org/en/latest/ .. _`easy_install`: http://peak.telecommunity.com/DevCenter/EasyInstall .. toctree:: :maxdepth: 2 :numbered: Indices and tables ================== * :ref:`genindex` * :ref:`modindex` * :ref:`search`