Subcommands¶
APRICOT comprises of distinct model designed to carry out specific task.
Each subcommand requires the path to the analysis folder
(‘APRICOT_analysis’ by deafult). Different subcommands can be quickly
viewed by running -h
for help option (e.g. apricot -h
or
python3 APRICOT/bin/apricot -h
).
usage: apricot [-h] [--version]
{create,taxid,query,keywords,select,predict,filter,classify,annoscore,summary,addanno,vis,format}
...
positional arguments:
{create,taxid,query,keywords,select,predict,filter,classify,annoscore,summary,addanno,vis,format}
APRICOT commands - Refer documentation for detail
default Analysis using all the required subcommands at their
default parameters
create Create analysis folders
taxid Download taxonomy ids from UniProt for the user
provided query species
query Map user provided comma separated queries to UniProt
ids
keywords Save user provided keywords for domain selection
(required) and analysis classification (-cl)
select Select functional domains of interest (specified by
keywords) from CDD (-C) and InterPro (-I) by default
predict Predict functional domains in the queries based on CDD
(-C) and InterPro (-I) databases by default
filter Filter queries predicted with domains of interest (and
optional parameter thresholds) and extend their
annotations
classify Optional classification of selected prediction in
smaller groups by class keywords
annoscore Score and rank predicted data by 'annotation scoring'
summary Summary analysis output
addanno Optional annotation of the selected protein by -PDB,
-PSORTB, -RAPTORX or -REFSS (see addanno -h)
vis Visualize analysis results (see vis -h) for detail
format Optional output file format as html or excel
optional arguments:
-h, --help show this help message and exit
--version, -v show version
subcommand default
¶
Quick help: $ apricot default -h
This subcommand calls the analysis pipeline of the software using the
default parameters. This subcommand by-default includes the subcommands
create
, keywords
, select
, predict
, filter
, classify
,
annoscore
, summary
and format
, which have been discussed
below in details. The two inputs: -i
and
-kw
, should be given by the users to supply the query
proteins (for example, UniProt ids) and keywords indicating the functions of interest (for
example, a list of RNA-binding domains ‘RRM,KH,RNP’) respectively.
The basic syntax to call this subcommand is:
$ apricot default -i {query proteins} -kw {functions of interest}
Several optional arguments associated with other subcommands have been included in default
.
Please check the usage for details:
usage: apricot default [-h]
Here are a few useful flags, which can be used with this subcommand:
--uids, -i Comma separated UniProt IDs
--kw_domain, -kw Comma separated keywords for domain selection
--classify, -cl Optional comma separated keyword for result classification
--cdd, -C Uses only CDD
--ipr, -I Uses only InterPro
--needle_dir, -nd path for the locally configured EMBOSS suite
--skip_select Skips running the subcommand 'select'
--skip_annoscore Skips running the subcommand 'annoscore'
--taxid, -tx Select taxonomy id for query species
--geneids, -gi Comma separated query genes
--proteome, -P Analyze entire proteome
--fasta, -fa Analyze fasta sequences
--force, -F force flag, removes existing files generated in the previous analysis
--db_root, -dr Uses to get absolute path of domain annoation files
--similarity, -sim Percent similarity of prediction with reference
--coverage, -cov Percent coverage of reference domain in prediction
--identity, -iden Percent identity of prediction with reference
--evalue, -eval Evalue of the domain prediction
--gap, -gap Percent gap in predicted domain
--bit, -bit Bit score in predicted domain
--xlsx, -XL create output files in excel file-format
subcommand create
¶
Quick help: $ apricot create -h
This subcommand creates all the required directories to store input and output data acquired from APRICOT analysis. The main analysis folder can be provided by the users (default name: APRICOT_analysis).
usage: apricot create [-h] analysis_path
positional arguments:
analysis_path Creates APRICOT_analysis folder for anlysis unless other
name/path is provided
The structure and annotation of directories and the enclosing files in the ‘input’ folder in the analysis directory:
APRICOT_analysis
└───├input
└───├query_proteins
└───├uniprot_reference_table
└───├mapped_query_annotation
The structure of directories and the enclosing files in the ‘output’ folder in the analysis directory:
APRICOT_analysis
└───├output
└───├0_predicted_domains # Location for the output data obtained from the subcommand 'predict'
└───├1_compiled_domain_information # Location for the output data obtained from the subcommand 'filter'
└───├2_selected_domain_information # Location for the classified data obtained from the subcommand 'classify'
└───├3_annotation_scoring # Location for the output data obtained from the subcommand 'annoscore'
└───├4_additional_annotations # Location for additional annotations for the selected
| # queries using subcommand 'addanno'
└───├5_analysis_summary # Location for the output data obtained from the subcommand 'summary'
└───├format_output_data # Location for the output data obtained from the subcommand 'format'
└───├visualization_files # Location for the output data obtained from the subcommand 'vis'
subcommand taxid
¶
Quick help: $ apricot taxid -h
The users can provide gene ids or protein names as queries to APRICOT, which is mapped against UniProt Knowledgebase in order to extract relevant information. Since, same gene/protein ids exist across various genomes/proteomes, one can limit the search of the query to a certain organism (rather than all the organisms in the database) by providing one or multiple taxonomy ids.
When the taxonomy id is not known by the users, this subcommand –taxid can be used to extract the id by providing species name.
usage: apricot taxid [-h] [--species SPECIES] db_path
positional arguments:
db_path
optional arguments:
-h, --help show this help message and exit
--species SPECIES, -s SPECIES
Species name (comma separated if more than one) for
taxonomy id retreival
The taxonomy ids are saved in the text file taxonomy_ids.txt in the directory reference_db_files.
source_files
└───├reference_db_files
| taxonomy_ids.txt
subcommand query
¶
Quick help: $ apricot query -h
As mentioned already, APRICOT gives multiple options to the users to supply queries. For example, the queries can be provided as UniProt ids (–uids), gene ids or protein names (–geneids), fasta sequences (–fasta) or only the taxonomy id (–taxid) for a complete proteome analysis (using flag -P).
Paths for the saving the query data and their corresponding fasta files, xml files, annotation tables etc. can be optinally provided by the users.
usage: apricot query [-h] [--analysis_path ANALYSIS_PATH] [--uids UIDS]
[--taxid TAXID] [--geneids GENEIDS] [--proteome]
[--fasta] [--query_path QUERY_PATH]
[--proteome_path PROTEOME_PATH] [--xml_path XML_PATH]
[--fasta_path FASTA_PATH] [--feature_table FEATURE_TABLE]
optional arguments:
-h, --help show this help message and exit
--analysis_path ANALYSIS_PATH, -ap ANALYSIS_PATH
Main analysis path
--uids UIDS, -i UIDS Comma separated UniProt IDs
--taxid TAXID, -tx TAXID
Select taxonomy id for query species
--geneids GENEIDS, -gi GENEIDS
Comma separated query genes
--proteome, -P Analyze entire proteome
--fasta, -fa Analyze fasta sequences
--query_path QUERY_PATH, -qp QUERY_PATH
Get proteome table from UniProt
--proteome_path PROTEOME_PATH, -pp PROTEOME_PATH
Get proteome table from UniProt
--xml_path XML_PATH, -xp XML_PATH
Get proteome table from UniProt
--fasta_path FASTA_PATH, -fp FASTA_PATH
Get proteome table from UniProt
--feature_table FEATURE_TABLE, -ft FEATURE_TABLE
Get proteome table from UniProt
APRICOT saves the user provided queries and related information extracted from UniProt knowledgebase (fasta files, xml files, reference files etc.) in the directories as described below.
APRICOT_analysis
└───├input
└───├query_proteins
| query_to_uids.txt # User provided queries (gene ids/protein names/whole proteome set)
| # mapped to the UniProt Ids (flag --uids, --geneids)
└───├uniprot_reference_table
| query_uids_reference.tab # Basic annotations of the query protein IDs (flag --uids, --geneids) set
| # or the whole proteome (flag -P) from a certain taxonomy (flag --taxid)
└───├mapped_query_annotation
└───├fasta_path_mapped_query # Location for protein FASTA sequences of each query
| | # qery fasta sequences are also saved here (flag --fasta)
| | query_id-1.fasta
| | query_id-2.fasta
| | ...
| | query_id-n.fasta
|
└───├xml_path_mapped_query # Location for protein FASTA sequences of each query
| | query_id-1.xml
| | query_id-2.xml
| | ...
| | query_id-n.xml
|
└───├mapped_protein_xml_info_tables
| query_feature_table.csv # File containing all the features of the queries
# obtained by parsing xml files
subcommand keywords
¶
Quick help: $ apricot keywords -h
Since APRICOT allows identification of certain protein classes like RNA-binding proteins by means of domains, one of the most essential input data, beside the query protein itself, is a comma-separated list of terms or keywords that potentially indicates to a protein functional classes (domain selection terms). Such terminologies could be any pfam id, Gene Ontology term, mesh term, simple biological terms like ‘RRM’ and ‘ribosome’, or a combination of all these types.
Multi-word terms can be provided by using ‘-’ as a connector, for example, ‘RNA-binding’ and ‘La-domain’.
In order to maintain stringent selection of truly functional domains, APRICOT by-default does not allow the selection of a domain entry if the domain selection term occurs in its annotation with any trailing words like prefixes or suffixes. This indicates the possibilities of omitting few relevant entries from the domain selection keywords, but it also ensures exclusion of several non-relevant domains that might get included by chance. However, users can allow prefix by using the hash symbol (#) in the beginning of a term and suffix when # is used at the end of the term. For example, by using ‘#RNA-binding’ one can allow the inclusion of ‘tRNA-binding’, ‘mtRNA-binding’etc, and by allowing ‘RNA-bind#’ one can allow varying verb forms for bind like binder, binding etc. Of course, one can allow both prefixes and suffixes (#RNA-bind#).
Optionally a second set of keywords for the classification of predicted domains can be provided by using flag -cl (result classification terms). This list can comprise of terms associated to biological functions, enzymatic activities or specific features. For example, the predicted RNA related domain data could be divided into the classification tags of RRM, ribosome, synthetase, helicases etc. Such classification can help users tremendously in navigating the large datasets or for the selection of representative protein for certain function conferred by the domains. When users do not provide result classification terms, APRICOT uses the domain selection terms for this purpose as well.
usage: apricot keywords [-h] [--classify CLASSIFY] [--db_root DB_ROOT]
kw_domain
positional arguments:
kw_domain Comma separated keywords for domain selection
optional arguments:
-h, --help show this help message and exit
--classify CLASSIFY, -cl CLASSIFY
Optional comma separated keyword for result
classification
--db_root DB_ROOT, -dr DB_ROOT
Path for keyword files
The keywords are saved in the directory source_files
in the
subfolder domain_data
shown below.
source_files
└───├domain_data
keywords_for_domain_selection.txt # All the terms for domain selection
keywords_for_result_classification.txt # All the terms for result classification
subcommand select
¶
Quick help: apricot select -h
This subcommand allows the selection of reference domains based on the domain selection terms (in subcommand keywords). For this purpose, by-default APRICOT scans each entries of the domains in both CDD and InterPro domain consortiums for the occurance of any domain selection term.
In case of multi word terms (which are provided by using ‘-‘ as a connector), the co-occurance of the terms are considered when the words in the same sentence or same context. To ensure a more complete selection of the domains, the gene-ontology associated to the domains are also checked and selected accordingly.
It is possible to limit the selection process in only one of the consortiums by providing flags -C for CDD or -I for InterPro. For cross mapping the domains in both the consortiums, APRICOT uses domain ids from the databases (Pfam, SMART and TIGRFAM) that are shared by both the consortiums.
usage: apricot select [-h] [--cdd] [--ipr] [--skip_select] [--dom_kw DOM_KW]
[--db_root DB_ROOT]
optional arguments:
-h, --help show this help message and exit
--cdd, -C Selects functional domains of interest from CDD
--ipr, -I Selects functional domains of interest from CDD
--skip_select, -skip_select
Skips running the subcommand 'select'
--dom_kw DOM_KW, -dk DOM_KW
Absolute path of keyword files
--db_root DB_ROOT, -dr DB_ROOT
Uses to get absolute path of domain annoation files,
keyword selected domains
The domains that are selected from CDD and InterPro are stored in the directory domains_data in the bin folder. The selected domains are compiled and saved into the file all_keyword_selected_domain_data.tab in the domain_data.
bin
│ ...
└───├domain_data
└───├cdd
└───├interpro
| all_keyword_selected_domain_data.tab
subcommand predict
¶
Quick help: $ apricot predict -h
This subcommand is used to begin the process of domain predictions in the query proteins by all the possible functional domains using RPSBLAST against CDD and InterProScan against InetrPro. APRICOT carries out the domain prediction from both CDD and InterPro consortiums by default but users can choose to predict domains from only one of the databases by using the flag -C for CDD and -I for InterPro. To overwrite old predictions, the flag -F (for force) can be used.
The run time of RPSBLAST is considerably less, therefore -C flag can be used to obtain a quick information of the functional domains. However, we recommend the default setting because the different databases involved in both the consortiums provide a larger scope for completeness of domain predictions.
The execution of this subcommand is the basic requirement for the APRICOT analysis. The main input of this step is fasta sequences of query proteins. This subcommand can be executed simultabeously or even before running the subcommand ‘select’.
usage: apricot predict [-h] [--analysis_path ANALYSIS_PATH] [--cdd] [--ipr]
[--force] [--cdd_db CDD_DB] [--ipr_db IPR_DB]
[--predicted PREDICTED] [--fasta_path FASTA_PATH]
optional arguments:
-h, --help show this help message and exit
--analysis_path ANALYSIS_PATH, -ap ANALYSIS_PATH
Provide output path for the analysis result of the
chosen method
--cdd, -C domain prediction based on CDD only
--ipr, -I domain prediction based on InterProScan only
--force, -F force flag for the current analysis, removes already
existing predictions
--cdd_db CDD_DB, -cdb CDD_DB
Provide absolute path of CDD databases based on the
chosen method
--ipr_db IPR_DB, -idb IPR_DB
Provide absolute path of InterPro databases based on
the chosen method
--predicted PREDICTED, -pred PREDICTED
Provide output path for domain prediction files
--fasta_path FASTA_PATH, -fp FASTA_PATH
Provide absolute path of fasta files for query
proteins
proteins
The resulting files of this analysis is stored in the first analysis directory ‘0_predicted_domains’ in the output folder of the main analysis directory. As shown below, the information of the domain predictions are stored as text files in the sub-folders corresponding to the domain consortiums. Since this subcommand is independent of the reference domains, these files containing information on domain predictions can be recycled or re-visited for the selection of different functional classes.
APRICOT_analysis
└───├output
└───├0_predicted_domains # Location for the output data obtained from the subcommand 'predict'
└───├cdd_analysis # Details of domain predicted from CDD for each query
| | query_id-1.txt
| | query_id-2.txt
| | ...
| | query_id-n.txt
|
└───├ipr_analysis # Details of domain predicted from InterPro for each query
| query_id-1.tsv
| query_id-2.tsv
| ...
| query_id-n.tsv
subcommand filter
¶
Quick help: $ apricot filter -h
The filtering of the predicted domains by this subcommand take place by using the domain selection terms, hence this subcommand should be executed after ‘select’ and ‘predict’ subcommands.
Query proteins that consist of at least one of the selected domains are retained whereas the rest of the proteins are discarded from the downstream analysis. To limit the analysis to one of the consortiums only, flag -C for CDD based information and -I for InterPro based information can be used.
The users can choose their cut-offs for the parameters by using the flags –similarity, –coverage, –identity, –evalue, –bit (bit score) and –gap. However, the default parameters for the selection of predicted domains are defined as ‘coverage > 39’ and ‘similarity > 24’, which have been derived from a large RNA-binding positive and negative training sets collected from SwissProt database.
usage: apricot filter [-h] [--analysis_path ANALYSIS_PATH] [--cdd] [--ipr]
[--domain_description_file DOMAIN_DESCRIPTION_FILE]
[--similarity SIMILARITY] [--coverage COVERAGE]
[--identity IDENTITY] [--evalue EVALUE] [--gap GAP]
[--bit BIT] [--go_path GO_PATH] [--pred_path PRED_PATH]
[--up_table UP_TABLE] [--xml_info XML_INFO]
[--compile_out COMPILE_OUT] [--selected SELECTED]
optional arguments:
-h, --help show this help message and exit
--analysis_path ANALYSIS_PATH, -ap ANALYSIS_PATH
Provide analysis path
--cdd, -C Filter of domain prediction based on CDD only
--ipr, -I Filter of domain prediction based on InterProScan only
--domain_description_file DOMAIN_DESCRIPTION_FILE, -dd DOMAIN_DESCRIPTION_FILE
Description table of the selected domains
--similarity SIMILARITY, -sim SIMILARITY
Percent similarity of prediction with reference
--coverage COVERAGE, -cov COVERAGE
Percent coverage of reference domain in prediction
--identity IDENTITY, -iden IDENTITY
Percent identity of prediction with reference
--evalue EVALUE, -eval EVALUE
Evalue of the domain prediction
--gap GAP, -gap GAP Percent gap in predicted domain
--bit BIT, -bit BIT Bit score in predicted domain
--go_path GO_PATH, -gp GO_PATH
Go mapping data from fixed database reference files
--pred_path PRED_PATH, -predp PRED_PATH
Raw files of domain prediction
--up_table UP_TABLE, -ref UP_TABLE
Uniprot proteome table from UniProt
--xml_info XML_INFO, -feat XML_INFO
Uniprot proteome table from UniProt
--compile_out COMPILE_OUT, -co COMPILE_OUT
Data with annotation after filtering
--selected SELECTED, -sel SELECTED
output path for the selected data with annotations
APRICOT saves all the domain data in the directory ‘1_compiled_domain_information’ of the output folder. All the predicted domains (independent of reference domains and parameter cut-offs) are saved in the sub-folder ‘unfiltered_data’ and the selected data is saved in the sub-folder ‘selected_data’ in separate files for different domain resources as shown below.
Files in the sub-folder ‘selected_data’ contain predicted domain entry based on the reference domain sets and are marked with the tags ParameterSelected when the domain predictions satisfy the defined parameter cut-offs (or default cut-offs) or Parameter Discarded when it does not pass the parameter filters. In those cases, when certain parameter is not available for the predicted domain, a tag ParameterNotApplicable is used.
APRICOT_analysis
└───├output
└───├1_compiled_domain_information # Location for the output data obtained from the subcommand 'filter'
└───├unfiltered_data # Information of all the domains in the query proteins predicted.
| | cdd_unfiltered_all_prediction.csv # CDD
| | ipr_unfiltered_all_prediction.csv # InterPro
|
└───├selected_data # Information of the selected reference domains in the query proteins
| cdd_filtered.csv # CDD
| ipr_filtered.csv # InterPro
Queries, that are selected on the basis of reference domains and parameter cut-offs, are compiled and stored in the directory ‘2_selected_domain_information’ in the sub-folder ‘combined_data’. These files contain the information of selected domains along with the additional annotations of the query proteins extracted from various resources like UniProt and Gene Ontology .
APRICOT_analysis
└───├output
└───├2_selected_domain_information
└───├combined_data # All the selected domain data extended
| # with the UniProt annotation
| annotation_extended_for_selected.csv
Sub-commands for downstream analysis¶
subcommand classify
¶
Quick help: $ apricot classify -h
This subcommand classifies the resulting domain information of the selected queries by using the result classification terms (provided in the subcommand ‘keywords’).
usage: apricot classify [-h] [--analysis_path ANALYSIS_PATH]
[--selected SELECTED] [--class_kw CLASS_KW]
[--classify CLASSIFY] [--classified CLASSIFIED]
[--db_root DB_ROOT]
optional arguments:
-h, --help show this help message and exit
--analysis_path ANALYSIS_PATH, -ap ANALYSIS_PATH
Provide analysis path
--selected SELECTED, -sel SELECTED
Selected data file (from select) with annotations
--class_kw CLASS_KW, -ck CLASS_KW
Path for keyword files
--classify CLASSIFY, -cl CLASSIFY
Optional comma separated keyword for result
classification
--classified CLASSIFIED, -c CLASSIFIED
Classification of selected data based on provided
keywords
--db_root DB_ROOT, -dr DB_ROOT
Path for keyword files
The classified data are stored in the folder as shown below:
APRICOT_analysis
└───├output
└───├2_selected_domain_information
└───├classified_data # Location for the output data obtained
| # from the subcommand 'classify'
| classification_key-1_selected_data.csv # Files containing subsets of predicted data...
| classification_key-2_selected_data.csv # ... based on user provided classification keys.
subcommand annoscore
¶
This subcommand is executed for the annotation based scoring of the selcted domains in the query proteins.
In order to differentiate domain predictions of low confidence from that of high confidence, the predicted domain sites are compared with their corresponding references and scored by means of methods that measure their similarities by means of various sequence-based features. The comparisons of the features between the predicted domain sites and reference are scored based on the principle of Bayesian probability, where a score closer to 1 represents a favourable scenario.
There are four groups of features that are involved in the annotation based scoring. 1. Chemical properties 2. Needleman-Wunsch global alignment scores 3. Euclidean distances of protein compositions 4. Prediction parameters of the predicted sites
Quick help: $ apricot annoscore -h
usage: apricot annoscore [-h] [--analysis_path ANALYSIS_PATH]
[--selected SELECTED] [--cdd_pred CDD_PRED]
[--scored SCORED] [--needle_dir NEEDLE_DIR]
optional arguments:
-h, --help show this help message and exit
--analysis_path ANALYSIS_PATH, -ap ANALYSIS_PATH
Provide analysis path
--selected SELECTED, -sel SELECTED
Provided selected protein table
--cdd_pred CDD_PRED, -cp CDD_PRED
Raw files obtained from CDD based domain prediction
--scored SCORED, -sco SCORED
Output path for annotation scoring files
--needle_dir NEEDLE_DIR, -nd NEEDLE_DIR
path for the locally configured EMBOSS suite
The data with annotation scores are stored in the folder as shown below:
APRICOT_analysis
└───├output
└───├3_annotation_scoring # Location for the output data obtained
| # from the subcommand 'annoscore'
| annotation_extended_for_selected.csv
subcommand addanno
¶
Quick help: $ apricot addanno -h
This subcommand allows users to further annotate the query sequences that are selected based on the defined functional domains.
Following modules can be used with their respective flags for additional annotations of the selected proteins:
- Identification sub-cellular localization of the proteins (flag -psortb)
- Secondary structure calculation by RaptorX (flag -raptorx)
- Tertiary structure homologs from Protein Data Bank (flag -pdb)
- Gene Ontology (flag -go)
usage: apricot addanno [-h] [--force] [--pdb] [--psortb] [--raptorx] [--refss]
[--analysis_path ANALYSIS_PATH]
[--fasta_path FASTA_PATH] [--selected SELECTED]
[--add_out ADD_OUT] [--pdb_path PDB_PATH]
[--psortb_path PSORTB_PATH]
[--raptorx_path RAPTORX_PATH]
optional arguments:
-h, --help show this help message and exit
--force, -F force flag for the current analysis, removes already
existing predictions
--pdb, -PDB Optional annotation of the selected protein by PDB
structure homolog
--psortb, -PSORTB Optional annotation of the selected protein by
localization using PsortB
--raptorx, -RAPTORX Optional annotation of the selected protein by
secondary structure using RaptorX
--refss, -REFSS Optional annotation of the selected protein by
secondary structure using literature reference
--analysis_path ANALYSIS_PATH, -ap ANALYSIS_PATH
Provide analysis path
--fasta_path FASTA_PATH, -fp FASTA_PATH
Provide absolute path of fasta files for query
proteins
--selected SELECTED, -sel SELECTED
Provided selected protein table
--add_out ADD_OUT, -ao ADD_OUT
Output path for additional annotation data
--pdb_path PDB_PATH, -pdb_path PDB_PATH
Provide absolute path of APRICOT formatted pdb
database ~pdb/pdb_sequence/pdb_sequence.txt
--psortb_path PSORTB_PATH, -psortb_path PSORTB_PATH
Provide absolute path of APRICOT installed psortb
--raptorx_path RAPTORX_PATH, -raptorx_path RAPTORX_PATH
Provide absolute path of APRICOT installed raptorx
till the perl script run_raptorx-ss8.pl
The resulting files are stored in the directory 4_additional_annotations in the corresponding sub-folders, as shown below:
APRICOT_analysis
└───├output
└───├4_additional_annotations # Location for additional annotations for the
| # selected queries using subcommand 'addanno'
└───├pdb_sequence_prediction # PDB structure homologs of the selected
| # queries (flag --pdb, -PDB)
└───├protein_localization # PSORTb based localization of the selected
| # queries (flag --psortb, -PSORTB)
└───├protein_secondary_structure # RaptorX based structure of the selected
# queries (flag --raptorx, -RAPTORX)
subcommand summary
¶
Quick help: $ apricot summary -h
To get an overview of the analysis carried out on a set of query proteins, this sub command can be used. It generate information like, how many queries could be mapped to the UniProt IDs, how many contain the reference domains etc., to provide analysis overview.
usage: apricot summary [-h] [--analysis_path ANALYSIS_PATH]
[--query_map QUERY_MAP] [--domains DOMAINS]
[--unfilter_path UNFILTER_PATH]
[--summarized SUMMARIZED]
optional arguments:
-h, --help show this help message and exit
--analysis_path ANALYSIS_PATH, -ap ANALYSIS_PATH
Provide analysis path
--query_map QUERY_MAP, -q QUERY_MAP
query_to_uids.txt file created by APRICOT to save
query mapping information
--domains DOMAINS, -d DOMAINS
File containing all the keyword selected_domains of
interest
--unfilter_path UNFILTER_PATH, -uf UNFILTER_PATH
Directory with the unfiltered domain data from
output-1 (unfiltered_data)
--summarized SUMMARIZED, -sum SUMMARIZED
Provide output path
The resulting files are stored in the directory 5_analysis_summary in the corresponding sub-folders, as shown below:
APRICOT_analysis
└───├output
└───├5_analysis_summary # Location for the output data obtained from the subcommand 'summary'
| APRICOT_analysis_summary.csv
subcommand format
¶
Quick help: $ apricot format -h
Formats and stores various tables in the HTML tabels (–html), excel files (–xlsx) or both.
usage: apricot format [-h] [--analysis_path ANALYSIS_PATH] [--inpath INPATH]
[--html] [--xlsx] [--outpath OUTPATH]
optional arguments:
-h, --help show this help message and exit
--analysis_path ANALYSIS_PATH, -ap ANALYSIS_PATH
Provide analysis path
--inpath INPATH, -i INPATH
Choose folder from analysis to be converted
--html, -HT
--xlsx, -XL
--formatted FORMATTED, -form FORMATTED
Output path for files with different file formats
The resulting files are stored in the directory format_output_data in the corresponding sub-folders, as shown below:
APRICOT_analysis
└───├output
└───├format_output_data # Location for the output data obtained from the subcommand 'format'
└───├excel_files # excel files (flag -XL)
└───├html_files # HTML files (flag -HT)
subcommand vis
¶
Quick help: $ apricot vis -h
Visualize different resulting data like predicted domains sites, tertiary structure of selected proteins etc.
usage: apricot vis [-h] [--analysis_path ANALYSIS_PATH]
[--ann_score ANN_SCORE] [--add_anno ADD_ANNO]
[--selected SELECTED] [--domain] [--annoscore] [--secstr]
[--localiz] [--msa] [--complete] [--visualized VISUALIZED]
optional arguments:
-h, --help show this help message and exit
--analysis_path ANALYSIS_PATH, -ap ANALYSIS_PATH
Provide analysis path
--ann_score ANN_SCORE, -an ANN_SCORE
Provide annotation score file
--add_anno ADD_ANNO, -ad ADD_ANNO
Provide path to additional annotation
--selected SELECTED, -sel SELECTED
Provided selected protein table
--domain, -D Visualizes predicted domains on the query by
highlighting
--annoscore, -A Visualizes overview of prediction statistics
--secstr, -S Visualizes secondary structures predicted by RaptorX
--localiz, -L Visualizes subcellular localization predcited by
PsortB
--msa, -M Visualizes Multiple Sequence Alignments of homologous
sequences from PDB
--complete, -C Visualizes all the possible features
--visualized VISUALIZED, -vi VISUALIZED
Output path for visualization files
The resulting files are stored in the directory visualization_files in the corresponding sub-folders, as shown below:
APRICOT_analysis
└───├output
└───├visualization_files # Location for the output data obtained from the subcommand 'vis'
└───├domain_highlighting # Visualizing the domain sites on the protein sequences
└───├homologous_pdb_msa # Multiple sequence alignment of the structure homologs
└───├overview_and_statistics # Visualizing the overview of the selected query proteins
└───├secondary_structure # Visualizing 3-state secondary struvture of the query sequence
└───├subcellular_localization # Heatmap showing the probability of different localization sites