obigrep
: filters sequence file¶
The obigrep
command is in some way analog to the standard Unix grep
command.
It selects a subset of sequence records from a sequence file.
A sequence record is a complex object composed of an identifier,
a set of attributes (key=value
), a definition, and the sequence itself.
Instead of working text line by text line as the standard Unix tool, selection is done sequence record by sequence record. A large set of options allows refining selection on any of the sequence record elements.
Moreover obigrep
allows specifying simultaneously several conditions (that
take the value TRUE
or FALSE
) and only the sequence records that fulfill all
the conditions (all conditions are TRUE
) are selected.
Sequence record selection options¶
-
-s
<REGULAR_PATTERN>
,
--sequence
=<REGULAR_PATTERN>
¶ - Regular expression pattern to be tested against the sequence itself. The pattern is case insensitive.
Examples:
> obigrep -s 'GAATTC' seq1.fasta > seq2.fasta
Selects only the sequence records that contain an EcoRI restriction site.
> obigrep -s 'A{10,}' seq1.fasta > seq2.fasta
Selects only the sequence records that contain a stretch of at least 10
A
.> obigrep -s '^[ACGT]+$' seq1.fasta > seq2.fasta
Selects only the sequence records that do not contain ambiguous nucleotides.
-
-D
<REGULAR_PATTERN>
,
--definition
=<REGULAR_PATTERN>
¶ - Regular expression pattern to be tested against the definition of the sequence record. The pattern is case sensitive.
Example:
> obigrep -D '[Cc]hloroplast' seq1.fasta > seq2.fasta
Selects only the sequence records whose definition contains
chloroplast
orChloroplast
.
-
-I
<REGULAR_PATTERN>
,
--identifier
=<REGULAR_PATTERN>
¶ - Regular expression pattern to be tested against the identifier of the sequence record. The pattern is case sensitive.
Example:
> obigrep -I '^GH' seq1.fasta > seq2.fasta
Selects only the sequence records whose identifier begins with
GH
.
-
--id-list
=<FILENAME>
¶ <FILENAME>
points to a text file containing the list of sequence record identifiers to be selected. The file format consists in a single identifier per line.Example:
> obigrep --id-list=my_id_list.txt seq1.fasta > seq2.fasta
Selects only the sequence records whose identifier is present in the
my_id_list.txt
file.
-
-a
<KEY>:<REGULAR_PATTERN>,
¶
-
--attribute
=<KEY>:<REGULAR_PATTERN>
¶ - Regular expression pattern matched against the attributes of the sequence record. the value of this attribute is of the form : key:regular_pattern. The pattern is case sensitive. Several
-a
options can be used on the same command line and in this last case, the selected sequence records will match all constraints.Example:
> obigrep -a 'family_name:Asteraceae' seq1.fasta > seq2.fasta
Selects the sequence records containing an attribute whose key is
family_name
and value isAsteraceae
.
-
-A
<ATTRIBUTE_NAME>
,
--has-attribute
=<KEY>
¶ - Selects sequence records having an attribute whose key = <KEY>.
Example:
> obigrep -A taxid seq1.fasta > seq2.fasta
Selects only the sequence records having a taxid attribute defined.
-
-p
<PYTHON_EXPRESSION>
,
--predicat
=<PYTHON_EXPRESSION>
¶ - Python boolean expression to be evaluated for each sequence record. The attribute keys defined for each sequence record can be used in the expression as variable names. An extra variable named ‘sequence’ refers to the sequence record itself. Several -p options can be used on the same command line and in this last case, the selected sequence records will match all constraints.
Example:
> obigrep -p '(forward_error<2) and (reverse_error<2)' \ seq1.fasta > seq2.fasta
Selects only the sequence records whose
forward_error
andreverse_error
attributes have a value smaller than two.
-
-L
<##>
,
--lmax
=<##>
¶ - Keeps sequence records whose sequence length is equal or shorter than
lmax
.Example:
> obigrep -L 100 seq1.fasta > seq2.fasta
Selects only the sequence records that have a sequence length equal or shorter than 100bp.
-
-l
<##>
,
--lmin
=<##>
¶ - Selects sequence records whose sequence length is equal or longer than
lmin
.Examples:
> obigrep -l 100 seq1.fasta > seq2.fasta
Selects only the sequence records that have a sequence length equal or longer than 100bp.
-
-v
,
--inverse-match
¶
- Inverts the sequence record selection.
Examples:
> obigrep -v -l 100 seq1.fasta > seq2.fasta
Selects only the sequence records that have a sequence length shorter than 100bp.
Options to specify input format¶
Restrict the analysis to a sub-part of the input file¶
-
--skip
<N>
¶ The N first sequence records of the file are discarded from the analysis and not reported to the output file
-
--only
<N>
¶ Only the N next sequence records of the file are analyzed. The following sequences in the file are neither analyzed, neither reported to the output file. This option can be used conjointly with the –skip option.