|Latest Version| |License| |DOI| |image3|

|image4|

Subcommands
^^^^^^^^^^^

APRICOT comprises of distinct model designed to carry out specific task.

Each subcommand requires the path to the analysis folder
('APRICOT\_analysis' by deafult). Different subcommands can be quickly
viewed by running ``-h`` for help option (e.g. ``apricot -h`` or
``python3 APRICOT/bin/apricot -h``).

::

    usage: apricot [-h] [--version]
                   {create,taxid,query,keywords,select,predict,filter,classify,annoscore,summary,addanno,vis,format}
                   ...

    positional arguments:
      {create,taxid,query,keywords,select,predict,filter,classify,annoscore,summary,addanno,vis,format}
                            APRICOT commands - Refer documentation for detail
        default             Analysis using all the required subcommands at their
                            default parameters                    
        create              Create analysis folders
        taxid               Download taxonomy ids from UniProt for the user
                            provided query species
        query               Map user provided comma separated queries to UniProt
                            ids
        keywords            Save user provided keywords for domain selection
                            (required) and analysis classification (-cl)
        select              Select functional domains of interest (specified by
                            keywords) from CDD (-C) and InterPro (-I) by default
        predict             Predict functional domains in the queries based on CDD
                            (-C) and InterPro (-I) databases by default
        filter              Filter queries predicted with domains of interest (and
                            optional parameter thresholds) and extend their
                            annotations
        classify            Optional classification of selected prediction in
                            smaller groups by class keywords
        annoscore           Score and rank predicted data by 'annotation scoring'
        summary             Summary analysis output
        addanno             Optional annotation of the selected protein by -PDB,
                            -PSORTB, -RAPTORX or -REFSS (see addanno -h)
        vis                 Visualize analysis results (see vis -h) for detail
        format              Optional output file format as html or excel

    optional arguments:
      -h, --help            show this help message and exit
      --version, -v         show version

subcommand ``default``
----------------------

Quick help: ``$ apricot default -h``

This subcommand calls the analysis pipeline of the software using the
default parameters. This subcommand by-default includes the subcommands
``create``, ``keywords``, ``select``, ``predict``, ``filter``, ``classify``,
``annoscore``, ``summary`` and ``format``, which have been discussed
below in details. The two inputs: ``-i`` and
``-kw``, should be given by the users to supply the query
proteins (for example, UniProt ids) and keywords indicating the functions of interest (for
example, a list of RNA-binding domains 'RRM,KH,RNP') respectively.

The basic syntax to call this subcommand is:

::

    $ apricot default -i {query proteins} -kw {functions of interest}

Several optional arguments associated with other subcommands have been included in ``default``.
Please check the usage for details:

::

    usage: apricot default [-h]
	
Here are a few useful flags, which can be used with this subcommand:

::

	--uids, -i		Comma separated UniProt IDs
	--kw_domain, -kw	Comma separated keywords for domain selection
	--classify, -cl		Optional comma separated keyword for result classification
	--cdd, -C		Uses only CDD
	--ipr, -I		Uses only InterPro
	--needle_dir, -nd	path for the locally configured EMBOSS suite
	--skip_select		Skips running the subcommand 'select'
	--skip_annoscore		Skips running the subcommand 'annoscore'
	
	--taxid, -tx		Select taxonomy id for query species
	--geneids, -gi		Comma separated query genes
	--proteome, -P		Analyze entire proteome
	--fasta, -fa		Analyze fasta sequences
	--force, -F		force flag, removes existing files generated in the previous analysis
	
	--db_root, -dr		Uses to get absolute path of domain annoation files
	--similarity, -sim	Percent similarity of prediction with reference
	--coverage, -cov	Percent coverage of reference domain in prediction
	--identity, -iden	Percent identity of prediction with reference
	--evalue, -eval		Evalue of the domain prediction
	--gap, -gap		Percent gap in predicted domain
	--bit, -bit		Bit score in predicted domain
	
	--xlsx, -XL		create output files in excel file-format

subcommand ``create``
---------------------

Quick help: ``$ apricot create -h``

This subcommand creates all the required directories to store input and
output data acquired from APRICOT analysis. The main analysis folder can
be provided by the users (default name: APRICOT\_analysis).

::

    usage: apricot create [-h] analysis_path

    positional arguments:
      analysis_path  Creates APRICOT_analysis folder for anlysis unless other
                     name/path is provided

The structure and annotation of directories and the enclosing files in
the 'input' folder in the analysis directory:

::

    APRICOT_analysis
        └───├input
                └───├query_proteins
                └───├uniprot_reference_table
                └───├mapped_query_annotation  

The structure of directories and the enclosing files in the 'output'
folder in the analysis directory:

::

    APRICOT_analysis
        └───├output
                └───├0_predicted_domains            # Location for the output data obtained from the subcommand 'predict'
                └───├1_compiled_domain_information  # Location for the output data obtained from the subcommand 'filter'          
                └───├2_selected_domain_information  # Location for the classified data obtained from the subcommand 'classify' 
                └───├3_annotation_scoring           # Location for the output data obtained from the subcommand 'annoscore'
                └───├4_additional_annotations       # Location for additional annotations for the selected 
                |                                   # queries using subcommand 'addanno'
                └───├5_analysis_summary             # Location for the output data obtained from the subcommand 'summary'
                └───├format_output_data             # Location for the output data obtained from the subcommand 'format'
                └───├visualization_files            # Location for the output data obtained from the subcommand 'vis'

subcommand ``taxid``
--------------------

Quick help: ``$ apricot taxid -h``

The users can provide gene ids or protein names as queries to APRICOT,
which is mapped against UniProt Knowledgebase in order to extract
relevant information. Since, same gene/protein ids exist across various
genomes/proteomes, one can limit the search of the query to a certain
organism (rather than all the organisms in the database) by providing
one or multiple taxonomy ids.

When the taxonomy id is not known by the users, this subcommand --taxid
can be used to extract the id by providing species name.

::

    usage: apricot taxid [-h] [--species SPECIES] db_path

    positional arguments:
      db_path

    optional arguments:
      -h, --help            show this help message and exit
      --species SPECIES, -s SPECIES
                            Species name (comma separated if more than one) for
                            taxonomy id retreival

The taxonomy ids are saved in the text file taxonomy\_ids.txt in the
directory reference\_db\_files.

::

    source_files
        └───├reference_db_files
                |    taxonomy_ids.txt

subcommand ``query``
--------------------

Quick help: ``$ apricot query -h``

As mentioned already, APRICOT gives multiple options to the users to
supply queries. For example, the queries can be provided as UniProt ids
(--uids), gene ids or protein names (--geneids), fasta sequences
(--fasta) or only the taxonomy id (--taxid) for a complete proteome
analysis (using flag -P).

Paths for the saving the query data and their corresponding fasta files,
xml files, annotation tables etc. can be optinally provided by the
users.

::

	usage: apricot query [-h] [--analysis_path ANALYSIS_PATH] [--uids UIDS]
						 [--taxid TAXID] [--geneids GENEIDS] [--proteome]
						 [--fasta] [--query_path QUERY_PATH]
						 [--proteome_path PROTEOME_PATH] [--xml_path XML_PATH]
						 [--fasta_path FASTA_PATH] [--feature_table FEATURE_TABLE]
	
	optional arguments:
	  -h, --help            show this help message and exit
	  --analysis_path ANALYSIS_PATH, -ap ANALYSIS_PATH
							Main analysis path
	  --uids UIDS, -i UIDS  Comma separated UniProt IDs
	  --taxid TAXID, -tx TAXID
							Select taxonomy id for query species
	  --geneids GENEIDS, -gi GENEIDS
							Comma separated query genes
	  --proteome, -P        Analyze entire proteome
	  --fasta, -fa          Analyze fasta sequences
	  --query_path QUERY_PATH, -qp QUERY_PATH
							Get proteome table from UniProt
	  --proteome_path PROTEOME_PATH, -pp PROTEOME_PATH
							Get proteome table from UniProt
	  --xml_path XML_PATH, -xp XML_PATH
							Get proteome table from UniProt
	  --fasta_path FASTA_PATH, -fp FASTA_PATH
							Get proteome table from UniProt
	  --feature_table FEATURE_TABLE, -ft FEATURE_TABLE
							Get proteome table from UniProt

APRICOT saves the user provided queries and related information
extracted from UniProt knowledgebase (fasta files, xml files, reference
files etc.) in the directories as described below.

::

    APRICOT_analysis
        └───├input
                └───├query_proteins
                |   query_to_uids.txt           # User provided queries (gene ids/protein names/whole proteome set) 
                |                               # mapped to the UniProt Ids (flag --uids, --geneids)
                └───├uniprot_reference_table
                |   query_uids_reference.tab    # Basic annotations of the query protein IDs (flag --uids, --geneids)  set
                |                               # or the whole proteome (flag -P) from a certain taxonomy (flag --taxid)
                └───├mapped_query_annotation  
                        └───├fasta_path_mapped_query  # Location for protein FASTA sequences of each query
                        |   |                         # qery fasta sequences are also saved here (flag --fasta)
                        |   | query_id-1.fasta 
                        |   | query_id-2.fasta
                        |   | ...
                        |   | query_id-n.fasta
                        |
                        └───├xml_path_mapped_query    # Location for protein FASTA sequences of each query
                        |   | query_id-1.xml
                        |   | query_id-2.xml
                        |   | ...
                        |   | query_id-n.xml
                        |
                        └───├mapped_protein_xml_info_tables  
                            | query_feature_table.csv  # File containing all the features of the queries 
                                                       # obtained by parsing xml files

subcommand ``keywords``
-----------------------

Quick help: ``$ apricot keywords -h``

Since APRICOT allows identification of certain protein classes like
RNA-binding proteins by means of domains, one of the most essential
input data, beside the query protein itself, is a comma-separated list
of terms or keywords that potentially indicates to a protein functional
classes (*domain selection terms*). Such terminologies could be any pfam
id, Gene Ontology term, mesh term, simple biological terms like 'RRM'
and 'ribosome', or a combination of all these types.

Multi-word terms can be provided by using ‘-’ as a connector, for
example, 'RNA-binding' and 'La-domain'.

In order to maintain stringent selection of truly functional domains,
APRICOT by-default does not allow the selection of a domain entry if the
*domain selection term* occurs in its annotation with any trailing words
like prefixes or suffixes. This indicates the possibilities of omitting
few relevant entries from the domain selection keywords, but it also
ensures exclusion of several non-relevant domains that might get
included by chance. However, users can allow prefix by using the hash
symbol (#) in the beginning of a term and suffix when # is used at the
end of the term. For example, by using '#RNA-binding' one can allow the
inclusion of 'tRNA-binding', 'mtRNA-binding'etc, and by allowing
'RNA-bind#' one can allow varying verb forms for bind like binder,
binding etc. Of course, one can allow both prefixes and suffixes
(#RNA-bind#).

Optionally a second set of keywords for the classification of predicted
domains can be provided by using flag -cl (*result classification
terms*). This list can comprise of terms associated to biological
functions, enzymatic activities or specific features. For example, the
predicted RNA related domain data could be divided into the
classification tags of RRM, ribosome, synthetase, helicases etc. Such
classification can help users tremendously in navigating the large
datasets or for the selection of representative protein for certain
function conferred by the domains. When users do not provide *result
classification terms*, APRICOT uses the *domain selection terms* for
this purpose as well.

::

    usage: apricot keywords [-h] [--classify CLASSIFY] [--db_root DB_ROOT]
                            kw_domain

    positional arguments:
      kw_domain             Comma separated keywords for domain selection

    optional arguments:
      -h, --help            show this help message and exit
      --classify CLASSIFY, -cl CLASSIFY
                            Optional comma separated keyword for result
                            classification
      --db_root DB_ROOT, -dr DB_ROOT
                        Path for keyword files

The keywords are saved in the directory ``source_files`` in the
subfolder ``domain_data`` shown below.

::

    source_files
        └───├domain_data
                keywords_for_domain_selection.txt         # All the terms for domain selection
                keywords_for_result_classification.txt    # All the terms for result classification

subcommand ``select``
---------------------

Quick help: ``apricot select -h``

This subcommand allows the selection of reference domains based on the
*domain selection terms* (in subcommand keywords). For this purpose,
by-default APRICOT scans each entries of the domains in both CDD and
InterPro domain consortiums for the occurance of any *domain selection
term*.

In case of multi word terms (which are provided by using '-' as a
connector), the co-occurance of the terms are considered when the words
in the same sentence or same context. To ensure a more complete
selection of the domains, the gene-ontology associated to the domains
are also checked and selected accordingly.

It is possible to limit the selection process in only one of the
consortiums by providing flags -C for CDD or -I for InterPro. For cross
mapping the domains in both the consortiums, APRICOT uses domain ids
from the databases (Pfam, SMART and TIGRFAM) that are shared by both the
consortiums.

::

    usage: apricot select [-h] [--cdd] [--ipr] [--skip_select] [--dom_kw DOM_KW]
						  [--db_root DB_ROOT]

	optional arguments:
		-h, --help            show this help message and exit
		--cdd, -C             Selects functional domains of interest from CDD
		--ipr, -I             Selects functional domains of interest from CDD
		--skip_select, -skip_select
							  Skips running the subcommand 'select'
		--dom_kw DOM_KW, -dk DOM_KW
							  Absolute path of keyword files
		--db_root DB_ROOT, -dr DB_ROOT
							  Uses to get absolute path of domain annoation files,
							  keyword selected domains

The domains that are selected from CDD and InterPro are stored in the
directory domains\_data in the bin folder. The selected domains are
compiled and saved into the file
all\_keyword\_selected\_domain\_data.tab in the domain\_data.

::

    bin
    │   ...
    └───├domain_data
        └───├cdd
        └───├interpro
        | all_keyword_selected_domain_data.tab

subcommand ``predict``
----------------------

Quick help: ``$ apricot predict -h``

This subcommand is used to begin the process of domain predictions in
the query proteins by all the possible functional domains using RPSBLAST
against CDD and InterProScan against InetrPro. APRICOT carries out the
domain prediction from both CDD and InterPro consortiums by default but
users can choose to predict domains from only one of the databases by
using the flag -C for CDD and -I for InterPro. To overwrite old
predictions, the flag -F (for force) can be used.

The run time of RPSBLAST is considerably less, therefore -C flag can be
used to obtain a quick information of the functional domains. However,
we recommend the default setting because the different databases
involved in both the consortiums provide a larger scope for completeness
of domain predictions.

The execution of this subcommand is the basic requirement for the
APRICOT analysis. The main input of this step is fasta sequences of
query proteins. This subcommand can be executed simultabeously or even
before running the subcommand 'select'.

::

    usage: apricot predict [-h] [--analysis_path ANALYSIS_PATH] [--cdd] [--ipr]
                           [--force] [--cdd_db CDD_DB] [--ipr_db IPR_DB]
                           [--predicted PREDICTED] [--fasta_path FASTA_PATH]

    optional arguments:
		-h, --help            show this help message and exit
		--analysis_path ANALYSIS_PATH, -ap ANALYSIS_PATH
							  Provide output path for the analysis result of the
							  chosen method
		--cdd, -C             domain prediction based on CDD only
		--ipr, -I             domain prediction based on InterProScan only
		--force, -F           force flag for the current analysis, removes already
							  existing predictions
		--cdd_db CDD_DB, -cdb CDD_DB
							  Provide absolute path of CDD databases based on the
							  chosen method
		--ipr_db IPR_DB, -idb IPR_DB
							  Provide absolute path of InterPro databases based on
							  the chosen method
		--predicted PREDICTED, -pred PREDICTED
							  Provide output path for domain prediction files
		--fasta_path FASTA_PATH, -fp FASTA_PATH
							  Provide absolute path of fasta files for query
							  proteins
								  proteins

The resulting files of this analysis is stored in the first analysis
directory '0\_predicted\_domains' in the output folder of the main
analysis directory. As shown below, the information of the domain
predictions are stored as text files in the sub-folders corresponding to
the domain consortiums. Since this subcommand is independent of the
reference domains, these files containing information on domain
predictions can be recycled or re-visited for the selection of different
functional classes.

::

    APRICOT_analysis
        └───├output
                └───├0_predicted_domains # Location for the output data obtained from the subcommand 'predict'
                        └───├cdd_analysis  # Details of domain predicted from CDD for each query
                        |   | query_id-1.txt
                        |   | query_id-2.txt
                        |   | ...
                        |   | query_id-n.txt
                        |
                        └───├ipr_analysis  # Details of domain predicted from InterPro for each query
                            | query_id-1.tsv
                            | query_id-2.tsv
                            | ...
                            | query_id-n.tsv

subcommand ``filter``
---------------------

Quick help: ``$ apricot filter -h``

The filtering of the predicted domains by this subcommand take place by
using the *domain selection terms*, hence this subcommand should be
executed after 'select' and 'predict' subcommands.

Query proteins that consist of at least one of the selected domains are
retained whereas the rest of the proteins are discarded from the
downstream analysis. To limit the analysis to one of the consortiums
only, flag -C for CDD based information and -I for InterPro based
information can be used.

The users can choose their cut-offs for the parameters by using the
flags --similarity, --coverage, --identity, --evalue, --bit (bit score)
and --gap. However, the default parameters for the selection of
predicted domains are defined as 'coverage > 39' and 'similarity > 24',
which have been derived from a large RNA-binding positive and negative
training sets collected from SwissProt database.

::

	usage: apricot filter [-h] [--analysis_path ANALYSIS_PATH] [--cdd] [--ipr]
						  [--domain_description_file DOMAIN_DESCRIPTION_FILE]
						  [--similarity SIMILARITY] [--coverage COVERAGE]
						  [--identity IDENTITY] [--evalue EVALUE] [--gap GAP]
						  [--bit BIT] [--go_path GO_PATH] [--pred_path PRED_PATH]
						  [--up_table UP_TABLE] [--xml_info XML_INFO]
						  [--compile_out COMPILE_OUT] [--selected SELECTED]
	
	optional arguments:
	  -h, --help            show this help message and exit
	  --analysis_path ANALYSIS_PATH, -ap ANALYSIS_PATH
							Provide analysis path
	  --cdd, -C             Filter of domain prediction based on CDD only
	  --ipr, -I             Filter of domain prediction based on InterProScan only
	  --domain_description_file DOMAIN_DESCRIPTION_FILE, -dd DOMAIN_DESCRIPTION_FILE
							Description table of the selected domains
	  --similarity SIMILARITY, -sim SIMILARITY
							Percent similarity of prediction with reference
	  --coverage COVERAGE, -cov COVERAGE
							Percent coverage of reference domain in prediction
	  --identity IDENTITY, -iden IDENTITY
							Percent identity of prediction with reference
	  --evalue EVALUE, -eval EVALUE
							Evalue of the domain prediction
	  --gap GAP, -gap GAP   Percent gap in predicted domain
	  --bit BIT, -bit BIT   Bit score in predicted domain
	  --go_path GO_PATH, -gp GO_PATH
							Go mapping data from fixed database reference files
	  --pred_path PRED_PATH, -predp PRED_PATH
							Raw files of domain prediction
	  --up_table UP_TABLE, -ref UP_TABLE
							Uniprot proteome table from UniProt
	  --xml_info XML_INFO, -feat XML_INFO
							Uniprot proteome table from UniProt
	  --compile_out COMPILE_OUT, -co COMPILE_OUT
							Data with annotation after filtering
	  --selected SELECTED, -sel SELECTED
							output path for the selected data with annotations
							
APRICOT saves all the domain data in the directory
'1\_compiled\_domain\_information' of the output folder. All the
predicted domains (independent of reference domains and parameter
cut-offs) are saved in the sub-folder 'unfiltered\_data' and the
selected data is saved in the sub-folder 'selected\_data' in separate
files for different domain resources as shown below.

Files in the sub-folder 'selected\_data' contain predicted domain entry
based on the reference domain sets and are marked with the tags
*ParameterSelected* when the domain predictions satisfy the defined
parameter cut-offs (or default cut-offs) or *Parameter Discarded* when
it does not pass the parameter filters. In those cases, when certain
parameter is not available for the predicted domain, a tag
*ParameterNotApplicable* is used.

::

    APRICOT_analysis
        └───├output
            └───├1_compiled_domain_information  # Location for the output data obtained from the subcommand 'filter'          
                        └───├unfiltered_data  # Information of all the domains in the query proteins predicted.
                        |   | cdd_unfiltered_all_prediction.csv  # CDD 
                        |   | ipr_unfiltered_all_prediction.csv  # InterPro
                        |
                        └───├selected_data      # Information of the selected reference domains in the query proteins
                            | cdd_filtered.csv                   # CDD 
                            | ipr_filtered.csv                   # InterPro 

Queries, that are selected on the basis of reference domains and
parameter cut-offs, are compiled and stored in the directory
'2\_selected\_domain\_information' in the sub-folder 'combined\_data'.
These files contain the information of selected domains along with the
additional annotations of the query proteins extracted from various
resources like UniProt and Gene Ontology .

::

    APRICOT_analysis
        └───├output    
                └───├2_selected_domain_information            
                        └───├combined_data         # All the selected domain data extended 
                            |                       # with the UniProt annotation
                            | annotation_extended_for_selected.csv

Sub-commands for downstream analysis
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

subcommand ``classify``
-----------------------

Quick help: ``$ apricot classify -h``

This subcommand classifies the resulting domain information of the
selected queries by using the *result classification terms* (provided in
the subcommand 'keywords').

::

	usage: apricot classify [-h] [--analysis_path ANALYSIS_PATH]
							[--selected SELECTED] [--class_kw CLASS_KW]
							[--classify CLASSIFY] [--classified CLASSIFIED]
							[--db_root DB_ROOT]
	
	optional arguments:
	  -h, --help            show this help message and exit
	  --analysis_path ANALYSIS_PATH, -ap ANALYSIS_PATH
							Provide analysis path
	  --selected SELECTED, -sel SELECTED
							Selected data file (from select) with annotations
	  --class_kw CLASS_KW, -ck CLASS_KW
							Path for keyword files
	  --classify CLASSIFY, -cl CLASSIFY
							Optional comma separated keyword for result
							classification
	  --classified CLASSIFIED, -c CLASSIFIED
							Classification of selected data based on provided
							keywords
	  --db_root DB_ROOT, -dr DB_ROOT
							Path for keyword files

The classified data are stored in the folder as shown below:

::

    APRICOT_analysis
        └───├output    
                └───├2_selected_domain_information            
                        └───├classified_data                            # Location for the output data obtained 
                            |                                           # from the subcommand 'classify'
                            | classification_key-1_selected_data.csv    # Files containing subsets of predicted data...
                            | classification_key-2_selected_data.csv    # ... based on user provided classification keys.

subcommand ``annoscore``
------------------------

This subcommand is executed for the annotation based scoring of the
selcted domains in the query proteins.

In order to differentiate domain predictions of low confidence from that
of high confidence, the predicted domain sites are compared with their
corresponding references and scored by means of methods that measure
their similarities by means of various sequence-based features. The
comparisons of the features between the predicted domain sites and
reference are scored based on the principle of Bayesian probability,
where a score closer to 1 represents a favourable scenario.

There are four groups of features that are involved in the annotation
based scoring. 1. Chemical properties 2. Needleman-Wunsch global
alignment scores 3. Euclidean distances of protein compositions 4.
Prediction parameters of the predicted sites

Quick help: ``$ apricot annoscore -h``

::

	usage: apricot annoscore [-h] [--analysis_path ANALYSIS_PATH]
							 [--selected SELECTED] [--cdd_pred CDD_PRED]
							 [--scored SCORED] [--needle_dir NEEDLE_DIR]
	
	optional arguments:
	  -h, --help            show this help message and exit
	  --analysis_path ANALYSIS_PATH, -ap ANALYSIS_PATH
							Provide analysis path
	  --selected SELECTED, -sel SELECTED
							Provided selected protein table
	  --cdd_pred CDD_PRED, -cp CDD_PRED
							Raw files obtained from CDD based domain prediction
	  --scored SCORED, -sco SCORED
							Output path for annotation scoring files
	  --needle_dir NEEDLE_DIR, -nd NEEDLE_DIR
							path for the locally configured EMBOSS suite

The data with annotation scores are stored in the folder as shown below:

::

    APRICOT_analysis
        └───├output
                └───├3_annotation_scoring          # Location for the output data obtained 
                    |                              # from the subcommand 'annoscore'
                    | annotation_extended_for_selected.csv

subcommand ``addanno``
----------------------

Quick help: ``$ apricot addanno -h``

This subcommand allows users to further annotate the query sequences
that are selected based on the defined functional domains.

Following modules can be used with their respective flags for additional
annotations of the selected proteins:

1. Identification sub-cellular localization of the proteins (flag
   -psortb)
2. Secondary structure calculation by RaptorX (flag -raptorx)
3. Tertiary structure homologs from Protein Data Bank (flag -pdb)
4. Gene Ontology (flag -go)

::

	usage: apricot addanno [-h] [--force] [--pdb] [--psortb] [--raptorx] [--refss]
						   [--analysis_path ANALYSIS_PATH]
						   [--fasta_path FASTA_PATH] [--selected SELECTED]
						   [--add_out ADD_OUT] [--pdb_path PDB_PATH]
						   [--psortb_path PSORTB_PATH]
						   [--raptorx_path RAPTORX_PATH]
	
	optional arguments:
	  -h, --help            show this help message and exit
	  --force, -F           force flag for the current analysis, removes already
							existing predictions
	  --pdb, -PDB           Optional annotation of the selected protein by PDB
							structure homolog
	  --psortb, -PSORTB     Optional annotation of the selected protein by
							localization using PsortB
	  --raptorx, -RAPTORX   Optional annotation of the selected protein by
							secondary structure using RaptorX
	  --refss, -REFSS       Optional annotation of the selected protein by
							secondary structure using literature reference
	  --analysis_path ANALYSIS_PATH, -ap ANALYSIS_PATH
							Provide analysis path
	  --fasta_path FASTA_PATH, -fp FASTA_PATH
							Provide absolute path of fasta files for query
							proteins
	  --selected SELECTED, -sel SELECTED
							Provided selected protein table
	  --add_out ADD_OUT, -ao ADD_OUT
							Output path for additional annotation data
	  --pdb_path PDB_PATH, -pdb_path PDB_PATH
							Provide absolute path of APRICOT formatted pdb
							database ~pdb/pdb_sequence/pdb_sequence.txt
	  --psortb_path PSORTB_PATH, -psortb_path PSORTB_PATH
							Provide absolute path of APRICOT installed psortb
	  --raptorx_path RAPTORX_PATH, -raptorx_path RAPTORX_PATH
							Provide absolute path of APRICOT installed raptorx
							till the perl script run_raptorx-ss8.pl

The resulting files are stored in the directory
4\_additional\_annotations in the corresponding sub-folders, as shown
below:

::

    APRICOT_analysis
        └───├output
                └───├4_additional_annotations               # Location for additional annotations for the 
                        |                                   # selected queries using subcommand 'addanno'
                        └───├pdb_sequence_prediction        # PDB structure homologs of the selected 
                        |                                   # queries (flag --pdb, -PDB)
                        └───├protein_localization           # PSORTb based localization of the selected 
                        |                                   # queries (flag --psortb, -PSORTB)
                        └───├protein_secondary_structure    # RaptorX based structure of the selected 
                                                            # queries (flag --raptorx, -RAPTORX)

subcommand ``summary``
----------------------

Quick help: ``$ apricot summary -h``

To get an overview of the analysis carried out on a set of query
proteins, this sub command can be used. It generate information like,
how many queries could be mapped to the UniProt IDs, how many contain
the reference domains etc., to provide analysis overview.

::

	usage: apricot summary [-h] [--analysis_path ANALYSIS_PATH]
						   [--query_map QUERY_MAP] [--domains DOMAINS]
						   [--unfilter_path UNFILTER_PATH]
						   [--summarized SUMMARIZED]
	
	optional arguments:
	  -h, --help            show this help message and exit
	  --analysis_path ANALYSIS_PATH, -ap ANALYSIS_PATH
							Provide analysis path
	  --query_map QUERY_MAP, -q QUERY_MAP
							query_to_uids.txt file created by APRICOT to save
							query mapping information
	  --domains DOMAINS, -d DOMAINS
							File containing all the keyword selected_domains of
							interest
	  --unfilter_path UNFILTER_PATH, -uf UNFILTER_PATH
							Directory with the unfiltered domain data from
							output-1 (unfiltered_data)
	  --summarized SUMMARIZED, -sum SUMMARIZED
							Provide output path

The resulting files are stored in the directory 5\_analysis\_summary in
the corresponding sub-folders, as shown below:

::

    APRICOT_analysis
        └───├output
                └───├5_analysis_summary # Location for the output data obtained from the subcommand 'summary'
                    | APRICOT_analysis_summary.csv

subcommand ``format``
---------------------

Quick help: ``$ apricot format -h``

Formats and stores various tables in the HTML tabels (--html), excel
files (--xlsx) or both.

::

    usage: apricot format [-h] [--analysis_path ANALYSIS_PATH] [--inpath INPATH]
                          [--html] [--xlsx] [--outpath OUTPATH]

    optional arguments:
      -h, --help            show this help message and exit
      --analysis_path ANALYSIS_PATH, -ap ANALYSIS_PATH
                            Provide analysis path
      --inpath INPATH, -i INPATH
                            Choose folder from analysis to be converted
      --html, -HT
      --xlsx, -XL
      --formatted FORMATTED, -form FORMATTED
                            Output path for files with different file formats

The resulting files are stored in the directory format\_output\_data in
the corresponding sub-folders, as shown below:

::

    APRICOT_analysis
        └───├output
                └───├format_output_data # Location for the output data obtained from the subcommand 'format'
                        └───├excel_files               # excel files (flag -XL)
                        └───├html_files                # HTML files (flag -HT)

subcommand ``vis``
------------------

Quick help: ``$ apricot vis -h``

Visualize different resulting data like predicted domains sites,
tertiary structure of selected proteins etc.

::

	usage: apricot vis [-h] [--analysis_path ANALYSIS_PATH]
					   [--ann_score ANN_SCORE] [--add_anno ADD_ANNO]
					   [--selected SELECTED] [--domain] [--annoscore] [--secstr]
					   [--localiz] [--msa] [--complete] [--visualized VISUALIZED]
	
	optional arguments:
	  -h, --help            show this help message and exit
	  --analysis_path ANALYSIS_PATH, -ap ANALYSIS_PATH
							Provide analysis path
	  --ann_score ANN_SCORE, -an ANN_SCORE
							Provide annotation score file
	  --add_anno ADD_ANNO, -ad ADD_ANNO
							Provide path to additional annotation
	  --selected SELECTED, -sel SELECTED
							Provided selected protein table
	  --domain, -D          Visualizes predicted domains on the query by
							highlighting
	  --annoscore, -A       Visualizes overview of prediction statistics
	  --secstr, -S          Visualizes secondary structures predicted by RaptorX
	  --localiz, -L         Visualizes subcellular localization predcited by
							PsortB
	  --msa, -M             Visualizes Multiple Sequence Alignments of homologous
							sequences from PDB
	  --complete, -C        Visualizes all the possible features
	  --visualized VISUALIZED, -vi VISUALIZED
							Output path for visualization files

The resulting files are stored in the directory visualization\_files in
the corresponding sub-folders, as shown below:

::

    APRICOT_analysis
        └───├output
                └───├visualization_files # Location for the output data obtained from the subcommand 'vis'
                        └───├domain_highlighting      # Visualizing the domain sites on the protein sequences
                        └───├homologous_pdb_msa       # Multiple sequence alignment of the structure homologs
                        └───├overview_and_statistics  # Visualizing the overview of the selected query proteins
                        └───├secondary_structure      # Visualizing 3-state secondary struvture of the query sequence
                        └───├subcellular_localization # Heatmap showing the probability of different localization sites 

.. |Latest Version| image:: https://img.shields.io/pypi/v/bio-apricot.svg
   :target: https://pypi.python.org/pypi/bio-apricot/
.. |License| image:: https://img.shields.io/pypi/l/bio-apricot.svg
   :target: https://pypi.python.org/pypi/bio-apricot/
.. |DOI| image:: https://zenodo.org/badge/21283/malvikasharan/APRICOT.svg
   :target: https://zenodo.org/badge/latestdoi/21283/malvikasharan/APRICOT
.. |image3| image:: https://images.microbadger.com/badges/image/malvikasharan/apricot.svg
   :target: https://microbadger.com/images/malvikasharan/apricot
.. |image4| image:: https://raw.githubusercontent.com/malvikasharan/APRICOT/master/APRICOT_logo.png
   :target: http://malvikasharan.github.io/APRICOT/