The extended OBITools fasta and fastq format¶
The extended OBITools Fasta format is a strict fasta format file. The file in extended OBITools Fasta format can be readed by all programs reading fasta files.
Difference between standard and extended fasta is just the structure of the title line. For OBITools title line is divided in three parts :
- Seqid : the sequence identifier
- key=value; : a set of key/value keys
- the sequence definition
>my_sequence taxid=3456; direct=True; sample=A354; this is my pretty sequence
ACGTTGCAGTACGTTGCAGTACGTTGCAGTACGTTGCAGTACGTTGCAGTACGTTGCAGT
GTGCTGACGTTGCAGTACGTTGCAGTACGTTGCAGTACGTTGCAGTACGTTGCAGTGTTT
AACGACGTTGCAGTACGTTGCAGT
Following these rules, the title line can be parsed :
The sequence identifier of this sequence is : my_sequence
- Three keys are assigned to this sequence :
- Key taxid with value 3456
- Key direct with value True
- Key sample with value A354
The definition of this sequence is this is my pretty sequence
Values can be any valid python expression. If a key value cannot be evaluated as a python expression, it is them assumed as a simple string. Following this rule, taxid value is considered as an integer value, direct value as a boolean and sample value is not a valid python expression so it is considered as a string value.
Names reserved for attributes¶
The following attribute names are created by some obitools programs and used by others. They have a special meaning. So we recommend not to use them with another semantic.
Contents:
- ali_dir
- ali_length
- avg_quality
- best_match
- best_identity
- class
- cluster
- complemented
- count
- cut
- direction
- distance
- error
- experiment
- family
- family_name
- forward_error
- forward_match
- forward_primer
- forward_score
- forward_tag
- forward_tm
- genus
- genus_name
- head_quality
- id_status
- merged_*
- merged
- mid_quality
- mode
- obiclean_cluster
- obiclean_cluster
- obiclean_cluster
- obiclean_headcount
- obiclean_internalcount
- obiclean_samplecount
- obiclean_singletoncount
- obiclean_status
- occurrence
- order
- order_name
- pairend_limit
- partial
- rank
- reverse_error
- reverse_match
- reverse_primer
- reverse_score
- reverse_tag
- reverse_tm
- sample
- scientific_name
- score
- score_norm
- select
- seq_ab_match
- seq_a_single
- seq_a_mismatch
- seq_a_deletion
- seq_a_insertion
- seq_b_single
- seq_b_mismatch
- seq_b_deletion
- seq_b_insertion
- seq_length
- seq_length_ori
- seq_rank
- sminL
- sminR
- species
- species_list
- species_name
- status
- strand
- mid_quality
- taxid