Subcommands¶

APRICOT comprises of distinct model designed to carry out specific task.

Each subcommand requires the path to the analysis folder (‘APRICOT_analysis’ by deafult). Different subcommands can be quickly viewed by running -h for help option (e.g. apricot -h or python3 APRICOT/bin/apricot -h).

usage: apricot [-h] [--version]
               {create,taxid,query,keywords,select,predict,filter,classify,annoscore,summary,addanno,vis,format}
               ...

positional arguments:
  {create,taxid,query,keywords,select,predict,filter,classify,annoscore,summary,addanno,vis,format}
                        APRICOT commands - Refer documentation for detail
    default             Analysis using all the required subcommands at their
                        default parameters
    create              Create analysis folders
    taxid               Download taxonomy ids from UniProt for the user
                        provided query species
    query               Map user provided comma separated queries to UniProt
                        ids
    keywords            Save user provided keywords for domain selection
                        (required) and analysis classification (-cl)
    select              Select functional domains of interest (specified by
                        keywords) from CDD (-C) and InterPro (-I) by default
    predict             Predict functional domains in the queries based on CDD
                        (-C) and InterPro (-I) databases by default
    filter              Filter queries predicted with domains of interest (and
                        optional parameter thresholds) and extend their
                        annotations
    classify            Optional classification of selected prediction in
                        smaller groups by class keywords
    annoscore           Score and rank predicted data by 'annotation scoring'
    summary             Summary analysis output
    addanno             Optional annotation of the selected protein by -PDB,
                        -PSORTB, -RAPTORX or -REFSS (see addanno -h)
    vis                 Visualize analysis results (see vis -h) for detail
    format              Optional output file format as html or excel

optional arguments:
  -h, --help            show this help message and exit
  --version, -v         show version

subcommand `default`¶

Quick help: $ apricot default -h

This subcommand calls the analysis pipeline of the software using the default parameters. This subcommand by-default includes the subcommands create, keywords, select, predict, filter, classify, annoscore, summary and format, which have been discussed below in details. The two inputs: -i and -kw, should be given by the users to supply the query proteins (for example, UniProt ids) and keywords indicating the functions of interest (for example, a list of RNA-binding domains ‘RRM,KH,RNP’) respectively.

The basic syntax to call this subcommand is:

$ apricot default -i {query proteins} -kw {functions of interest}

Several optional arguments associated with other subcommands have been included in default. Please check the usage for details:

usage: apricot default [-h]

Here are a few useful flags, which can be used with this subcommand:

--uids, -i              Comma separated UniProt IDs
--kw_domain, -kw        Comma separated keywords for domain selection
--classify, -cl         Optional comma separated keyword for result classification
--cdd, -C               Uses only CDD
--ipr, -I               Uses only InterPro
--needle_dir, -nd       path for the locally configured EMBOSS suite
--skip_select           Skips running the subcommand 'select'
--skip_annoscore                Skips running the subcommand 'annoscore'

--taxid, -tx            Select taxonomy id for query species
--geneids, -gi          Comma separated query genes
--proteome, -P          Analyze entire proteome
--fasta, -fa            Analyze fasta sequences
--force, -F             force flag, removes existing files generated in the previous analysis

--db_root, -dr          Uses to get absolute path of domain annoation files
--similarity, -sim      Percent similarity of prediction with reference
--coverage, -cov        Percent coverage of reference domain in prediction
--identity, -iden       Percent identity of prediction with reference
--evalue, -eval         Evalue of the domain prediction
--gap, -gap             Percent gap in predicted domain
--bit, -bit             Bit score in predicted domain

--xlsx, -XL             create output files in excel file-format

subcommand `create`¶

Quick help: $ apricot create -h

This subcommand creates all the required directories to store input and output data acquired from APRICOT analysis. The main analysis folder can be provided by the users (default name: APRICOT_analysis).

usage: apricot create [-h] analysis_path

positional arguments:
  analysis_path  Creates APRICOT_analysis folder for anlysis unless other
                 name/path is provided

The structure and annotation of directories and the enclosing files in the ‘input’ folder in the analysis directory:

APRICOT_analysis
    └───├input
            └───├query_proteins
            └───├uniprot_reference_table
            └───├mapped_query_annotation

The structure of directories and the enclosing files in the ‘output’ folder in the analysis directory:

APRICOT_analysis
    └───├output
            └───├0_predicted_domains            # Location for the output data obtained from the subcommand 'predict'
            └───├1_compiled_domain_information  # Location for the output data obtained from the subcommand 'filter'
            └───├2_selected_domain_information  # Location for the classified data obtained from the subcommand 'classify'
            └───├3_annotation_scoring           # Location for the output data obtained from the subcommand 'annoscore'
            └───├4_additional_annotations       # Location for additional annotations for the selected
            |                                   # queries using subcommand 'addanno'
            └───├5_analysis_summary             # Location for the output data obtained from the subcommand 'summary'
            └───├format_output_data             # Location for the output data obtained from the subcommand 'format'
            └───├visualization_files            # Location for the output data obtained from the subcommand 'vis'

subcommand `taxid`¶

Quick help: $ apricot taxid -h

The users can provide gene ids or protein names as queries to APRICOT, which is mapped against UniProt Knowledgebase in order to extract relevant information. Since, same gene/protein ids exist across various genomes/proteomes, one can limit the search of the query to a certain organism (rather than all the organisms in the database) by providing one or multiple taxonomy ids.

When the taxonomy id is not known by the users, this subcommand –taxid can be used to extract the id by providing species name.

usage: apricot taxid [-h] [--species SPECIES] db_path

positional arguments:
  db_path

optional arguments:
  -h, --help            show this help message and exit
  --species SPECIES, -s SPECIES
                        Species name (comma separated if more than one) for
                        taxonomy id retreival

The taxonomy ids are saved in the text file taxonomy_ids.txt in the directory reference_db_files.

source_files
    └───├reference_db_files
            |    taxonomy_ids.txt

subcommand `query`¶

Quick help: $ apricot query -h

As mentioned already, APRICOT gives multiple options to the users to supply queries. For example, the queries can be provided as UniProt ids (–uids), gene ids or protein names (–geneids), fasta sequences (–fasta) or only the taxonomy id (–taxid) for a complete proteome analysis (using flag -P).

Paths for the saving the query data and their corresponding fasta files, xml files, annotation tables etc. can be optinally provided by the users.

usage: apricot query [-h] [--analysis_path ANALYSIS_PATH] [--uids UIDS]
                                         [--taxid TAXID] [--geneids GENEIDS] [--proteome]
                                         [--fasta] [--query_path QUERY_PATH]
                                         [--proteome_path PROTEOME_PATH] [--xml_path XML_PATH]
                                         [--fasta_path FASTA_PATH] [--feature_table FEATURE_TABLE]

optional arguments:
  -h, --help            show this help message and exit
  --analysis_path ANALYSIS_PATH, -ap ANALYSIS_PATH
                                                Main analysis path
  --uids UIDS, -i UIDS  Comma separated UniProt IDs
  --taxid TAXID, -tx TAXID
                                                Select taxonomy id for query species
  --geneids GENEIDS, -gi GENEIDS
                                                Comma separated query genes
  --proteome, -P        Analyze entire proteome
  --fasta, -fa          Analyze fasta sequences
  --query_path QUERY_PATH, -qp QUERY_PATH
                                                Get proteome table from UniProt
  --proteome_path PROTEOME_PATH, -pp PROTEOME_PATH
                                                Get proteome table from UniProt
  --xml_path XML_PATH, -xp XML_PATH
                                                Get proteome table from UniProt
  --fasta_path FASTA_PATH, -fp FASTA_PATH
                                                Get proteome table from UniProt
  --feature_table FEATURE_TABLE, -ft FEATURE_TABLE
                                                Get proteome table from UniProt

APRICOT saves the user provided queries and related information extracted from UniProt knowledgebase (fasta files, xml files, reference files etc.) in the directories as described below.

APRICOT_analysis
    └───├input
            └───├query_proteins
            |   query_to_uids.txt           # User provided queries (gene ids/protein names/whole proteome set)
            |                               # mapped to the UniProt Ids (flag --uids, --geneids)
            └───├uniprot_reference_table
            |   query_uids_reference.tab    # Basic annotations of the query protein IDs (flag --uids, --geneids)  set
            |                               # or the whole proteome (flag -P) from a certain taxonomy (flag --taxid)
            └───├mapped_query_annotation
                    └───├fasta_path_mapped_query  # Location for protein FASTA sequences of each query
                    |   |                         # qery fasta sequences are also saved here (flag --fasta)
                    |   | query_id-1.fasta
                    |   | query_id-2.fasta
                    |   | ...
                    |   | query_id-n.fasta
                    |
                    └───├xml_path_mapped_query    # Location for protein FASTA sequences of each query
                    |   | query_id-1.xml
                    |   | query_id-2.xml
                    |   | ...
                    |   | query_id-n.xml
                    |
                    └───├mapped_protein_xml_info_tables
                        | query_feature_table.csv  # File containing all the features of the queries
                                                   # obtained by parsing xml files

subcommand `keywords`¶

Quick help: $ apricot keywords -h

Since APRICOT allows identification of certain protein classes like RNA-binding proteins by means of domains, one of the most essential input data, beside the query protein itself, is a comma-separated list of terms or keywords that potentially indicates to a protein functional classes (domain selection terms). Such terminologies could be any pfam id, Gene Ontology term, mesh term, simple biological terms like ‘RRM’ and ‘ribosome’, or a combination of all these types.

Multi-word terms can be provided by using ‘-’ as a connector, for example, ‘RNA-binding’ and ‘La-domain’.

In order to maintain stringent selection of truly functional domains, APRICOT by-default does not allow the selection of a domain entry if the domain selection term occurs in its annotation with any trailing words like prefixes or suffixes. This indicates the possibilities of omitting few relevant entries from the domain selection keywords, but it also ensures exclusion of several non-relevant domains that might get included by chance. However, users can allow prefix by using the hash symbol (#) in the beginning of a term and suffix when # is used at the end of the term. For example, by using ‘#RNA-binding’ one can allow the inclusion of ‘tRNA-binding’, ‘mtRNA-binding’etc, and by allowing ‘RNA-bind#’ one can allow varying verb forms for bind like binder, binding etc. Of course, one can allow both prefixes and suffixes (#RNA-bind#).

Optionally a second set of keywords for the classification of predicted domains can be provided by using flag -cl (result classification terms). This list can comprise of terms associated to biological functions, enzymatic activities or specific features. For example, the predicted RNA related domain data could be divided into the classification tags of RRM, ribosome, synthetase, helicases etc. Such classification can help users tremendously in navigating the large datasets or for the selection of representative protein for certain function conferred by the domains. When users do not provide result classification terms, APRICOT uses the domain selection terms for this purpose as well.

usage: apricot keywords [-h] [--classify CLASSIFY] [--db_root DB_ROOT]
                        kw_domain

positional arguments:
  kw_domain             Comma separated keywords for domain selection

optional arguments:
  -h, --help            show this help message and exit
  --classify CLASSIFY, -cl CLASSIFY
                        Optional comma separated keyword for result
                        classification
  --db_root DB_ROOT, -dr DB_ROOT
                    Path for keyword files

The keywords are saved in the directory source_files in the subfolder domain_data shown below.

source_files
    └───├domain_data
            keywords_for_domain_selection.txt         # All the terms for domain selection
            keywords_for_result_classification.txt    # All the terms for result classification

subcommand `select`¶

Quick help: apricot select -h

This subcommand allows the selection of reference domains based on the domain selection terms (in subcommand keywords). For this purpose, by-default APRICOT scans each entries of the domains in both CDD and InterPro domain consortiums for the occurance of any domain selection term.

In case of multi word terms (which are provided by using ‘-‘ as a connector), the co-occurance of the terms are considered when the words in the same sentence or same context. To ensure a more complete selection of the domains, the gene-ontology associated to the domains are also checked and selected accordingly.

It is possible to limit the selection process in only one of the consortiums by providing flags -C for CDD or -I for InterPro. For cross mapping the domains in both the consortiums, APRICOT uses domain ids from the databases (Pfam, SMART and TIGRFAM) that are shared by both the consortiums.

usage: apricot select [-h] [--cdd] [--ipr] [--skip_select] [--dom_kw DOM_KW]
                                              [--db_root DB_ROOT]

    optional arguments:
            -h, --help            show this help message and exit
            --cdd, -C             Selects functional domains of interest from CDD
            --ipr, -I             Selects functional domains of interest from CDD
            --skip_select, -skip_select
                                                      Skips running the subcommand 'select'
            --dom_kw DOM_KW, -dk DOM_KW
                                                      Absolute path of keyword files
            --db_root DB_ROOT, -dr DB_ROOT
                                                      Uses to get absolute path of domain annoation files,
                                                      keyword selected domains

The domains that are selected from CDD and InterPro are stored in the directory domains_data in the bin folder. The selected domains are compiled and saved into the file all_keyword_selected_domain_data.tab in the domain_data.

bin
│   ...
└───├domain_data
    └───├cdd
    └───├interpro
    | all_keyword_selected_domain_data.tab

subcommand `predict`¶

Quick help: $ apricot predict -h

This subcommand is used to begin the process of domain predictions in the query proteins by all the possible functional domains using RPSBLAST against CDD and InterProScan against InetrPro. APRICOT carries out the domain prediction from both CDD and InterPro consortiums by default but users can choose to predict domains from only one of the databases by using the flag -C for CDD and -I for InterPro. To overwrite old predictions, the flag -F (for force) can be used.

The run time of RPSBLAST is considerably less, therefore -C flag can be used to obtain a quick information of the functional domains. However, we recommend the default setting because the different databases involved in both the consortiums provide a larger scope for completeness of domain predictions.

The execution of this subcommand is the basic requirement for the APRICOT analysis. The main input of this step is fasta sequences of query proteins. This subcommand can be executed simultabeously or even before running the subcommand ‘select’.

usage: apricot predict [-h] [--analysis_path ANALYSIS_PATH] [--cdd] [--ipr]
                       [--force] [--cdd_db CDD_DB] [--ipr_db IPR_DB]
                       [--predicted PREDICTED] [--fasta_path FASTA_PATH]

optional arguments:
            -h, --help            show this help message and exit
            --analysis_path ANALYSIS_PATH, -ap ANALYSIS_PATH
                                                      Provide output path for the analysis result of the
                                                      chosen method
            --cdd, -C             domain prediction based on CDD only
            --ipr, -I             domain prediction based on InterProScan only
            --force, -F           force flag for the current analysis, removes already
                                                      existing predictions
            --cdd_db CDD_DB, -cdb CDD_DB
                                                      Provide absolute path of CDD databases based on the
                                                      chosen method
            --ipr_db IPR_DB, -idb IPR_DB
                                                      Provide absolute path of InterPro databases based on
                                                      the chosen method
            --predicted PREDICTED, -pred PREDICTED
                                                      Provide output path for domain prediction files
            --fasta_path FASTA_PATH, -fp FASTA_PATH
                                                      Provide absolute path of fasta files for query
                                                      proteins
                                                              proteins

The resulting files of this analysis is stored in the first analysis directory ‘0_predicted_domains’ in the output folder of the main analysis directory. As shown below, the information of the domain predictions are stored as text files in the sub-folders corresponding to the domain consortiums. Since this subcommand is independent of the reference domains, these files containing information on domain predictions can be recycled or re-visited for the selection of different functional classes.

APRICOT_analysis
    └───├output
            └───├0_predicted_domains # Location for the output data obtained from the subcommand 'predict'
                    └───├cdd_analysis  # Details of domain predicted from CDD for each query
                    |   | query_id-1.txt
                    |   | query_id-2.txt
                    |   | ...
                    |   | query_id-n.txt
                    |
                    └───├ipr_analysis  # Details of domain predicted from InterPro for each query
                        | query_id-1.tsv
                        | query_id-2.tsv
                        | ...
                        | query_id-n.tsv

subcommand `filter`¶

Quick help: $ apricot filter -h

The filtering of the predicted domains by this subcommand take place by using the domain selection terms, hence this subcommand should be executed after ‘select’ and ‘predict’ subcommands.

Query proteins that consist of at least one of the selected domains are retained whereas the rest of the proteins are discarded from the downstream analysis. To limit the analysis to one of the consortiums only, flag -C for CDD based information and -I for InterPro based information can be used.

The users can choose their cut-offs for the parameters by using the flags –similarity, –coverage, –identity, –evalue, –bit (bit score) and –gap. However, the default parameters for the selection of predicted domains are defined as ‘coverage > 39’ and ‘similarity > 24’, which have been derived from a large RNA-binding positive and negative training sets collected from SwissProt database.

usage: apricot filter [-h] [--analysis_path ANALYSIS_PATH] [--cdd] [--ipr]
                                          [--domain_description_file DOMAIN_DESCRIPTION_FILE]
                                          [--similarity SIMILARITY] [--coverage COVERAGE]
                                          [--identity IDENTITY] [--evalue EVALUE] [--gap GAP]
                                          [--bit BIT] [--go_path GO_PATH] [--pred_path PRED_PATH]
                                          [--up_table UP_TABLE] [--xml_info XML_INFO]
                                          [--compile_out COMPILE_OUT] [--selected SELECTED]

optional arguments:
  -h, --help            show this help message and exit
  --analysis_path ANALYSIS_PATH, -ap ANALYSIS_PATH
                                                Provide analysis path
  --cdd, -C             Filter of domain prediction based on CDD only
  --ipr, -I             Filter of domain prediction based on InterProScan only
  --domain_description_file DOMAIN_DESCRIPTION_FILE, -dd DOMAIN_DESCRIPTION_FILE
                                                Description table of the selected domains
  --similarity SIMILARITY, -sim SIMILARITY
                                                Percent similarity of prediction with reference
  --coverage COVERAGE, -cov COVERAGE
                                                Percent coverage of reference domain in prediction
  --identity IDENTITY, -iden IDENTITY
                                                Percent identity of prediction with reference
  --evalue EVALUE, -eval EVALUE
                                                Evalue of the domain prediction
  --gap GAP, -gap GAP   Percent gap in predicted domain
  --bit BIT, -bit BIT   Bit score in predicted domain
  --go_path GO_PATH, -gp GO_PATH
                                                Go mapping data from fixed database reference files
  --pred_path PRED_PATH, -predp PRED_PATH
                                                Raw files of domain prediction
  --up_table UP_TABLE, -ref UP_TABLE
                                                Uniprot proteome table from UniProt
  --xml_info XML_INFO, -feat XML_INFO
                                                Uniprot proteome table from UniProt
  --compile_out COMPILE_OUT, -co COMPILE_OUT
                                                Data with annotation after filtering
  --selected SELECTED, -sel SELECTED
                                                output path for the selected data with annotations

APRICOT saves all the domain data in the directory ‘1_compiled_domain_information’ of the output folder. All the predicted domains (independent of reference domains and parameter cut-offs) are saved in the sub-folder ‘unfiltered_data’ and the selected data is saved in the sub-folder ‘selected_data’ in separate files for different domain resources as shown below.

Files in the sub-folder ‘selected_data’ contain predicted domain entry based on the reference domain sets and are marked with the tags ParameterSelected when the domain predictions satisfy the defined parameter cut-offs (or default cut-offs) or Parameter Discarded when it does not pass the parameter filters. In those cases, when certain parameter is not available for the predicted domain, a tag ParameterNotApplicable is used.

APRICOT_analysis
    └───├output
        └───├1_compiled_domain_information  # Location for the output data obtained from the subcommand 'filter'
                    └───├unfiltered_data  # Information of all the domains in the query proteins predicted.
                    |   | cdd_unfiltered_all_prediction.csv  # CDD
                    |   | ipr_unfiltered_all_prediction.csv  # InterPro
                    |
                    └───├selected_data      # Information of the selected reference domains in the query proteins
                        | cdd_filtered.csv                   # CDD
                        | ipr_filtered.csv                   # InterPro

Queries, that are selected on the basis of reference domains and parameter cut-offs, are compiled and stored in the directory ‘2_selected_domain_information’ in the sub-folder ‘combined_data’. These files contain the information of selected domains along with the additional annotations of the query proteins extracted from various resources like UniProt and Gene Ontology .

APRICOT_analysis
    └───├output
            └───├2_selected_domain_information
                    └───├combined_data         # All the selected domain data extended
                        |                       # with the UniProt annotation
                        | annotation_extended_for_selected.csv

Sub-commands for downstream analysis¶

subcommand `classify`¶

Quick help: $ apricot classify -h

This subcommand classifies the resulting domain information of the selected queries by using the result classification terms (provided in the subcommand ‘keywords’).

usage: apricot classify [-h] [--analysis_path ANALYSIS_PATH]
                                                [--selected SELECTED] [--class_kw CLASS_KW]
                                                [--classify CLASSIFY] [--classified CLASSIFIED]
                                                [--db_root DB_ROOT]

optional arguments:
  -h, --help            show this help message and exit
  --analysis_path ANALYSIS_PATH, -ap ANALYSIS_PATH
                                                Provide analysis path
  --selected SELECTED, -sel SELECTED
                                                Selected data file (from select) with annotations
  --class_kw CLASS_KW, -ck CLASS_KW
                                                Path for keyword files
  --classify CLASSIFY, -cl CLASSIFY
                                                Optional comma separated keyword for result
                                                classification
  --classified CLASSIFIED, -c CLASSIFIED
                                                Classification of selected data based on provided
                                                keywords
  --db_root DB_ROOT, -dr DB_ROOT
                                                Path for keyword files

The classified data are stored in the folder as shown below:

APRICOT_analysis
    └───├output
            └───├2_selected_domain_information
                    └───├classified_data                            # Location for the output data obtained
                        |                                           # from the subcommand 'classify'
                        | classification_key-1_selected_data.csv    # Files containing subsets of predicted data...
                        | classification_key-2_selected_data.csv    # ... based on user provided classification keys.

subcommand `annoscore`¶

This subcommand is executed for the annotation based scoring of the selcted domains in the query proteins.

In order to differentiate domain predictions of low confidence from that of high confidence, the predicted domain sites are compared with their corresponding references and scored by means of methods that measure their similarities by means of various sequence-based features. The comparisons of the features between the predicted domain sites and reference are scored based on the principle of Bayesian probability, where a score closer to 1 represents a favourable scenario.

There are four groups of features that are involved in the annotation based scoring. 1. Chemical properties 2. Needleman-Wunsch global alignment scores 3. Euclidean distances of protein compositions 4. Prediction parameters of the predicted sites

Quick help: $ apricot annoscore -h

usage: apricot annoscore [-h] [--analysis_path ANALYSIS_PATH]
                                                 [--selected SELECTED] [--cdd_pred CDD_PRED]
                                                 [--scored SCORED] [--needle_dir NEEDLE_DIR]

optional arguments:
  -h, --help            show this help message and exit
  --analysis_path ANALYSIS_PATH, -ap ANALYSIS_PATH
                                                Provide analysis path
  --selected SELECTED, -sel SELECTED
                                                Provided selected protein table
  --cdd_pred CDD_PRED, -cp CDD_PRED
                                                Raw files obtained from CDD based domain prediction
  --scored SCORED, -sco SCORED
                                                Output path for annotation scoring files
  --needle_dir NEEDLE_DIR, -nd NEEDLE_DIR
                                                path for the locally configured EMBOSS suite

The data with annotation scores are stored in the folder as shown below:

APRICOT_analysis
    └───├output
            └───├3_annotation_scoring          # Location for the output data obtained
                |                              # from the subcommand 'annoscore'
                | annotation_extended_for_selected.csv

subcommand `addanno`¶

Quick help: $ apricot addanno -h

This subcommand allows users to further annotate the query sequences that are selected based on the defined functional domains.

Following modules can be used with their respective flags for additional annotations of the selected proteins:

Identification sub-cellular localization of the proteins (flag -psortb)
Secondary structure calculation by RaptorX (flag -raptorx)
Tertiary structure homologs from Protein Data Bank (flag -pdb)
Gene Ontology (flag -go)

usage: apricot addanno [-h] [--force] [--pdb] [--psortb] [--raptorx] [--refss]
                                           [--analysis_path ANALYSIS_PATH]
                                           [--fasta_path FASTA_PATH] [--selected SELECTED]
                                           [--add_out ADD_OUT] [--pdb_path PDB_PATH]
                                           [--psortb_path PSORTB_PATH]
                                           [--raptorx_path RAPTORX_PATH]

optional arguments:
  -h, --help            show this help message and exit
  --force, -F           force flag for the current analysis, removes already
                                                existing predictions
  --pdb, -PDB           Optional annotation of the selected protein by PDB
                                                structure homolog
  --psortb, -PSORTB     Optional annotation of the selected protein by
                                                localization using PsortB
  --raptorx, -RAPTORX   Optional annotation of the selected protein by
                                                secondary structure using RaptorX
  --refss, -REFSS       Optional annotation of the selected protein by
                                                secondary structure using literature reference
  --analysis_path ANALYSIS_PATH, -ap ANALYSIS_PATH
                                                Provide analysis path
  --fasta_path FASTA_PATH, -fp FASTA_PATH
                                                Provide absolute path of fasta files for query
                                                proteins
  --selected SELECTED, -sel SELECTED
                                                Provided selected protein table
  --add_out ADD_OUT, -ao ADD_OUT
                                                Output path for additional annotation data
  --pdb_path PDB_PATH, -pdb_path PDB_PATH
                                                Provide absolute path of APRICOT formatted pdb
                                                database ~pdb/pdb_sequence/pdb_sequence.txt
  --psortb_path PSORTB_PATH, -psortb_path PSORTB_PATH
                                                Provide absolute path of APRICOT installed psortb
  --raptorx_path RAPTORX_PATH, -raptorx_path RAPTORX_PATH
                                                Provide absolute path of APRICOT installed raptorx
                                                till the perl script run_raptorx-ss8.pl

The resulting files are stored in the directory 4_additional_annotations in the corresponding sub-folders, as shown below:

APRICOT_analysis
    └───├output
            └───├4_additional_annotations               # Location for additional annotations for the
                    |                                   # selected queries using subcommand 'addanno'
                    └───├pdb_sequence_prediction        # PDB structure homologs of the selected
                    |                                   # queries (flag --pdb, -PDB)
                    └───├protein_localization           # PSORTb based localization of the selected
                    |                                   # queries (flag --psortb, -PSORTB)
                    └───├protein_secondary_structure    # RaptorX based structure of the selected
                                                        # queries (flag --raptorx, -RAPTORX)

subcommand `summary`¶

Quick help: $ apricot summary -h

To get an overview of the analysis carried out on a set of query proteins, this sub command can be used. It generate information like, how many queries could be mapped to the UniProt IDs, how many contain the reference domains etc., to provide analysis overview.

usage: apricot summary [-h] [--analysis_path ANALYSIS_PATH]
                                           [--query_map QUERY_MAP] [--domains DOMAINS]
                                           [--unfilter_path UNFILTER_PATH]
                                           [--summarized SUMMARIZED]

optional arguments:
  -h, --help            show this help message and exit
  --analysis_path ANALYSIS_PATH, -ap ANALYSIS_PATH
                                                Provide analysis path
  --query_map QUERY_MAP, -q QUERY_MAP
                                                query_to_uids.txt file created by APRICOT to save
                                                query mapping information
  --domains DOMAINS, -d DOMAINS
                                                File containing all the keyword selected_domains of
                                                interest
  --unfilter_path UNFILTER_PATH, -uf UNFILTER_PATH
                                                Directory with the unfiltered domain data from
                                                output-1 (unfiltered_data)
  --summarized SUMMARIZED, -sum SUMMARIZED
                                                Provide output path

The resulting files are stored in the directory 5_analysis_summary in the corresponding sub-folders, as shown below:

APRICOT_analysis
    └───├output
            └───├5_analysis_summary # Location for the output data obtained from the subcommand 'summary'
                | APRICOT_analysis_summary.csv

subcommand `format`¶

Quick help: $ apricot format -h

Formats and stores various tables in the HTML tabels (–html), excel files (–xlsx) or both.

usage: apricot format [-h] [--analysis_path ANALYSIS_PATH] [--inpath INPATH]
                      [--html] [--xlsx] [--outpath OUTPATH]

optional arguments:
  -h, --help            show this help message and exit
  --analysis_path ANALYSIS_PATH, -ap ANALYSIS_PATH
                        Provide analysis path
  --inpath INPATH, -i INPATH
                        Choose folder from analysis to be converted
  --html, -HT
  --xlsx, -XL
  --formatted FORMATTED, -form FORMATTED
                        Output path for files with different file formats

The resulting files are stored in the directory format_output_data in the corresponding sub-folders, as shown below:

APRICOT_analysis
    └───├output
            └───├format_output_data # Location for the output data obtained from the subcommand 'format'
                    └───├excel_files               # excel files (flag -XL)
                    └───├html_files                # HTML files (flag -HT)

subcommand `vis`¶

Quick help: $ apricot vis -h

Visualize different resulting data like predicted domains sites, tertiary structure of selected proteins etc.

usage: apricot vis [-h] [--analysis_path ANALYSIS_PATH]
                                   [--ann_score ANN_SCORE] [--add_anno ADD_ANNO]
                                   [--selected SELECTED] [--domain] [--annoscore] [--secstr]
                                   [--localiz] [--msa] [--complete] [--visualized VISUALIZED]

optional arguments:
  -h, --help            show this help message and exit
  --analysis_path ANALYSIS_PATH, -ap ANALYSIS_PATH
                                                Provide analysis path
  --ann_score ANN_SCORE, -an ANN_SCORE
                                                Provide annotation score file
  --add_anno ADD_ANNO, -ad ADD_ANNO
                                                Provide path to additional annotation
  --selected SELECTED, -sel SELECTED
                                                Provided selected protein table
  --domain, -D          Visualizes predicted domains on the query by
                                                highlighting
  --annoscore, -A       Visualizes overview of prediction statistics
  --secstr, -S          Visualizes secondary structures predicted by RaptorX
  --localiz, -L         Visualizes subcellular localization predcited by
                                                PsortB
  --msa, -M             Visualizes Multiple Sequence Alignments of homologous
                                                sequences from PDB
  --complete, -C        Visualizes all the possible features
  --visualized VISUALIZED, -vi VISUALIZED
                                                Output path for visualization files

The resulting files are stored in the directory visualization_files in the corresponding sub-folders, as shown below:

APRICOT_analysis
    └───├output
            └───├visualization_files # Location for the output data obtained from the subcommand 'vis'
                    └───├domain_highlighting      # Visualizing the domain sites on the protein sequences
                    └───├homologous_pdb_msa       # Multiple sequence alignment of the structure homologs
                    └───├overview_and_statistics  # Visualizing the overview of the selected query proteins
                    └───├secondary_structure      # Visualizing 3-state secondary struvture of the query sequence
                    └───├subcellular_localization # Heatmap showing the probability of different localization sites

Subcommands¶

subcommand default¶

subcommand create¶

subcommand taxid¶

subcommand query¶

subcommand keywords¶

subcommand select¶

subcommand predict¶

subcommand filter¶