`obiannotate`: adds/edits sequence record annotations¶

obiannotate is the command that allows adding/modifying/removing annotation attributes attached to sequence records.

Once such attributes are added, they can be used by the other OBITools commands for filtering purposes or for statistics computing.

Example 1:

> obiannotate -S short:'len(sequence)<100' seq1.fasta > seq2.fasta
The above command adds an attribute named short which has a boolean value indicating whether the sequence length is less than 100bp.

Example 2:

> obiannotate --seq-rank seq1.fasta | \
  obiannotate -C --set-identifier '"'FungA'_%05d" % seq_rank' \
  > seq2.fasta
The above command adds a new attribute whose value is the sequence record entry number in the file. Then it clears all the sequence record attributes and sets the identifier to a string beginning with FungA_ followed by a suffix with 5 digits containing the sequence entry number.

Example 3:

> obiannotate -d my_ecopcr_database \
  --with-taxon-at-rank=genus seq1.fasta > seq2.fasta
The above command adds taxonomic information at the genus rank to the sequence records.

Example 4:

> obiannotate -S 'new_seq:str(sequence).replace("a","t")' \
  seq1.fasta | obiannotate --set-sequence new_seq > seq2.fasta
The overall aim of the above command is to edit the sequence object itself, by replacing all nucleotides a by nucleotides t. First, a new attribute named new_seq is created, which contains the modified sequence, and then the former sequence is replaced by the modified one.

Sequence record editing options¶

--seq-rank¶: Adds a new attribute named seq_rank to the sequence record indicating its entry number in the sequence file.

-R <OLD_NAME>:<NEW_NAME>, --rename-tag=<OLD_NAME>:<NEW_NAME>¶: Changes attribute name <OLD_NAME> to <NEW_NAME>. When attribute named <OLD_NAME> is missing, the sequence record is skipped and the next one is examined.

--delete-tag=<KEY>¶: Deletes attribute named <ATTRIBUTE_NAME>.When this attribute is missing, the sequence record is skipped and the next one is examined.

-S <KEY>:<PYTHON_EXPRESSION>, --set-tag=<KEY>:<PYTHON_EXPRESSION>¶: Creates a new attribute named with a key <KEY> and a value computed from <PYTHON_EXPRESSION>.

--tag-list=<FILENAME>¶: <FILENAME> points to a file containing attribute names and values to modify for specified sequence records.

--set-identifier=<PYTHON_EXPRESSION>¶: Sets sequence record identifier with a value computed from <PYTHON_EXPRESSION>.

--run=<PYTHON_EXPRESSION>¶: Runs a python expression on each selected sequence.

--set-sequence=<PYTHON_EXPRESSION>¶: Changes the sequence itself with a value computed from <PYTHON_EXPRESSION>.

-T, --set-definition=<PYTHON_EXPRESSION>¶: Sets sequence definition with a value computed from <PYTHON_EXPRESSION>.

-O, --only-valid-python¶: Allows only valid python expressions.

-C, --clear¶: Clears all attributes associated to the sequence records.

-k <KEY>, --keep=<KEY>¶: Keeps only attribute with key <KEY>. Several -k options can be combined.

--length¶: Adds attribute with seq_length as a key and sequence length as a value.

--with-taxon-at-rank=<RANK_NAME>¶: Adds taxonomic annotation at taxonomic rank <RANK_NAME>.

-m <MCLFILE>, --mcl=<MCLFILE>¶: Creates a new attribute containing the number of the cluster the sequence record was assigned to, as indicated in file <MCLFILE>.

--uniq-id¶: Forces sequence record ids to be unique.

Sequence record selection options¶

-s <REGULAR_PATTERN>, --sequence=<REGULAR_PATTERN>¶

Regular expression pattern to be tested against the sequence itself. The pattern is case insensitive.

Examples:

> obigrep -s 'GAATTC' seq1.fasta > seq2.fasta
Selects only the sequence records that contain an EcoRI restriction site.
> obigrep -s 'A{10,}' seq1.fasta > seq2.fasta
Selects only the sequence records that contain a stretch of at least 10 A.
> obigrep -s '^[ACGT]+$' seq1.fasta > seq2.fasta
Selects only the sequence records that do not contain ambiguous nucleotides.

-D <REGULAR_PATTERN>, --definition=<REGULAR_PATTERN>¶

Regular expression pattern to be tested against the definition of the sequence record. The pattern is case sensitive.

Example:

> obigrep -D '[Cc]hloroplast' seq1.fasta > seq2.fasta
Selects only the sequence records whose definition contains chloroplast or Chloroplast.

-I <REGULAR_PATTERN>, --identifier=<REGULAR_PATTERN>¶

Regular expression pattern to be tested against the identifier of the sequence record. The pattern is case sensitive.

Example:

> obigrep -I '^GH' seq1.fasta > seq2.fasta
Selects only the sequence records whose identifier begins with GH.

--id-list=<FILENAME>¶

<FILENAME> points to a text file containing the list of sequence record identifiers to be selected. The file format consists in a single identifier per line.

Example:

> obigrep --id-list=my_id_list.txt seq1.fasta > seq2.fasta
Selects only the sequence records whose identifier is present in the my_id_list.txt file.

-a <KEY>:<REGULAR_PATTERN>,¶

--attribute=<KEY>:<REGULAR_PATTERN>¶

Regular expression pattern matched against the attributes of the sequence record. the value of this attribute is of the form : key:regular_pattern. The pattern is case sensitive. Several -a options can be used on the same command line and in this last case, the selected sequence records will match all constraints.

Example:

> obigrep -a 'family_name:Asteraceae' seq1.fasta > seq2.fasta
Selects the sequence records containing an attribute whose key is family_name and value is Asteraceae.

-A <ATTRIBUTE_NAME>, --has-attribute=<KEY>¶

Selects sequence records having an attribute whose key = <KEY>.

Example:

> obigrep -A taxid seq1.fasta > seq2.fasta
Selects only the sequence records having a taxid attribute defined.

-p <PYTHON_EXPRESSION>, --predicat=<PYTHON_EXPRESSION>¶

Python boolean expression to be evaluated for each sequence record. The attribute keys defined for each sequence record can be used in the expression as variable names. An extra variable named ‘sequence’ refers to the sequence record itself. Several -p options can be used on the same command line and in this last case, the selected sequence records will match all constraints.

Example:

>  obigrep -p '(forward_error<2) and (reverse_error<2)' \
   seq1.fasta > seq2.fasta
Selects only the sequence records whose forward_error and reverse_error attributes have a value smaller than two.

-L <##>, --lmax=<##>¶

Keeps sequence records whose sequence length is equal or shorter than lmax.

Example:

> obigrep -L 100 seq1.fasta > seq2.fasta
Selects only the sequence records that have a sequence length equal or shorter than 100bp.

-l <##>, --lmin=<##>¶

Selects sequence records whose sequence length is equal or longer than lmin.

Examples:

> obigrep -l 100 seq1.fasta > seq2.fasta
Selects only the sequence records that have a sequence length equal or longer than 100bp.

-v, --inverse-match¶

Inverts the sequence record selection.

Examples:

> obigrep -v -l 100 seq1.fasta > seq2.fasta
Selects only the sequence records that have a sequence length shorter than 100bp.

Options to specify input format¶

Restrict the analysis to a sub-part of the input file¶

--skip <N>¶: The N first sequence records of the file are discarded from the analysis and not reported to the output file

--only <N>¶: Only the N next sequence records of the file are analyzed. The following sequences in the file are neither analyzed, neither reported to the output file. This option can be used conjointly with the –skip option.

Sequence annotated format¶

--genbank¶: Input file is in genbank format.

--embl¶: Input file is in embl format.

Specifying the sequence type¶

--nuc¶: Input file contains nucleic sequences.

--prot¶: Input file contains protein sequences.

Options to specify output format¶

Standard output format¶

--fasta-output¶: Output sequences in OBITools fasta format

--fastq-output¶: Output sequences in Sanger fastq format

Generating an ecoPCR database¶

--ecopcrdb-output=<PREFIX_FILENAME>¶: Creates an ecoPCR database from sequence records results

Miscellaneous option¶

--uppercase¶: Print sequences in upper case (default is lower case)

Common options¶

-h, --help¶: Shows this help message and exits.

--DEBUG¶: Sets logging in debug mode.

`obiannotate` added sequence attributes¶

seq_length

seq_rank

cluster

scientific_name

taxid

rank

family

family_name

genus

genus_name

order

order_name

species

species_name

`obiannotate`: adds/edits sequence record annotations¶

Sequence record editing options¶

Sequence record selection options¶

Options to specify input format¶

Restrict the analysis to a sub-part of the input file¶

Sequence annotated format¶

Specifying the sequence type¶

Options to specify output format¶

Standard output format¶

Generating an ecoPCR database¶

Miscellaneous option¶

Common options¶

`obiannotate` added sequence attributes¶

Table Of Contents

Previous topic

Next topic

This Page

obiannotate: adds/edits sequence record annotations¶

Sequence record editing options¶

Sequence record selection options¶

Taxonomy related options¶

Options to specify input format¶

Restrict the analysis to a sub-part of the input file¶

Sequence annotated format¶

fasta related format¶

fastq related format¶

ecoPCR related format¶

Specifying the sequence type¶

Options to specify output format¶

Standard output format¶

Generating an ecoPCR database¶

Miscellaneous option¶

Common options¶

obiannotate added sequence attributes¶

`obiannotate`: adds/edits sequence record annotations¶

`obiannotate` added sequence attributes¶