Tutorial¶
This tutorial provides an easy way to test different modules of APRICOT and understand the basic usage of this pipeline.
Requirements:¶
- Get APRICOT software
The software can be used via Docker container, or can be installed locally as explained here.
- Get the Shell script for running the software
Use wget or curl to download the script run_example.sh inside the Docker container (when working with Docker) or locally (when using the locally installed software), which will assist you in following the tutorial.
$ wget https://raw.githubusercontent.com/malvikasharan/APRICOT/master/shell_scripts/run_example.sh
OR
$ curl -O https://raw.githubusercontent.com/malvikasharan/APRICOT/master/shell_scripts/run_example.sh
Not required for this tutorial¶
APRICOT requires supporting data in order to carry out analysis of the user provided queries. In addition, for carrying out additional annotation of the proteins with biological features, it uses locally installed Psort and RaptorX executable. These requirements have been discussed in detail in the documentation. We will NOT require those for this tutorial, instead we have provided pre-processed files that should allow you to test the functionality of the tool. The test files are avaialable at: Zenodo record.
Example analysis¶
APRICOT can be executed by python (python3 or up) on Unix-flavors (GNU/Linux, BSD, OS X).
We will go through the shell script run_example.sh step by step. This script assumes that users have installed the software globally using pip.
Check the subcommands:
apricot -h
or
If you are using locally installed git repository
Please specifiy the the path where APRICOT modules are located (see
below: APRICOT_PATH) and run the tool using
python $APRICOT_PATH/bin/apricot
by editing
APRICOT_CALL="python3 ../bin/apricot"
(which would have the
advantage that you can specify the Python version you would like to use)
or ./$APRICOT_PATH/bin/apricot
command.
python $APRICOT_PATH/bin/apricot -h
If you get an error message
system_test.sh: 164: system_test.sh: ../bin/apricot: Permission denied ...
,
please open the script run_example.sh
and and edit the
APRICOT_CALL="../bin/apricot"
to
APRICOT_CALL="python3 ../bin/apricot"
(instead of python3, you can
use python or define any version of python, or instead use ./).
Defining paths¶
Users can set the following paths:
## User defined PATHS (default paths are given)
# Path where the analysis data by APRICOT will be stored
ANALYSIS_PATH=APRICOT_analysis
# Path where the source files and databases will be stored
ROOT_DB_PATH=source_files
# Path for EMBOSS suite with locally installed needle cline
NEEDLE_EMBOSS_PATH=/home/emboss/needle # Default path
# Path where APRICOT modules are located, by-default we use APRICOT as
# the name of git library
APRICOT_PATH=APRICOT
APRICOT stores databases and source files in these paths.
## FIXED PATHS
# Source data downloaded by APRICOT
DB_PATH=$ROOT_DB_PATH/reference_db_files
## PATHS for domain databases
CDD_PATH=$DB_PATH/cdd/Cdd
INTERPRO_PATH=$DB_PATH/interpro/interproscan
Creating analysis folders¶
- We will create each of the paths that we defined above in the section above:
donwload_files(){
mkdir -p $ROOT_DB_PATH $APRICOT_PATH $ANALYSIS_PATH $APRICOT_LIBRARY
for db_subpath in cdd go_mapping interpro pfam
do
mkdir -p $DB_PATH/$db_subpath
done
[...]
}
This will create a main folder source_files
with subfolder as shown
below.
source_files
└───├domain_data # Location for the files containing keywords for domain selection and subsequently selected domains
|
└───├emboss # EMBOSS package containing needle software
|
└───├reference_db_files
└───├all_taxid # Taxonomy ids for the reference of proteome analysis
└───├blast # BLAST package containing required executables
└───├cdd # Cdd related reference files
| └───├Cdd # Cdd database (Not required for the tutorial)
| └───├cdd_annotation_data # Cdd related annotation file
└───├pdb # All pdb structures as a reference for the tertiary structures
└───├pfam # Pfam annotation data
└───├interpro # InterPro related reference files
| └───├interproscan # Interpro database and InterProScan related tools (Not required for the tutorial)
| └───├interpro_annotation_data # interPro related annotation files
└───├go_mapping # GO related data containing GO anotation for the domains obtained from CDD and InterPro
- Using the
create
subcommand from APRICOT, we will create the analysis folder and its subfolder. This generates a main folderAPRICOT_analysis
(name can be defined by users), which contains subfolders as shown below.
apricot create APRICOT_analysis
APRICOT_analysis
└───├input # Location used by subcommand 'query' to store all the related files
| └───├query_proteins # Location for the list of query proteins
| |
| └───├uniprot_reference_table # Location for storing the reference table downloaded from UniProt
| |
| └───├mapped_query_annotation # All the query proteins that are mapped to at least one UniProt annotations
| | ├fasta_path_mapped_query # Fasta files obtained for all the query proteins, can be used for query fasta files
| | ├xml_path_mapped_query # UniProt xml files obtained for all the query proteins
|
└───├output
└───├0_predicted_domains # Location for the output data obtained from the subcommand 'predict'
└───├1_compiled_domain_information # Location for the output data obtained from the subcommand 'filter'
└───├2_selected_domain_information
└───├3_annotation_scoring # Location for the output data obtained from the subcommand 'annoscore'
└───├4_additional_annotations # Location for additional annotations for the selected
| # queries using subcommand 'addanno'
└───├5_analysis_summary # Location for the output data obtained from the subcommand 'summary'
└───├format_output_data # Location for the output data obtained from the subcommand 'format'
└───├visualization_files # Location for the output data obtained from the subcommand 'vis'
Fetching required source files¶
We have provided a few pre-processed files, which are available in apricot_demo_files folder (retrieved from zenodo), if it is already available, please comment the first three lines in the script, which copies these files to the APRICOT defined paths.
Additionally, we will download domain annotation files from CDD and Pfam databases.
complete_data_for_demo(){
## Here DB_PATH=source_files/reference_db_files and $ANALYSIS_PATH=APRICOT_analysis
zenodo_link_for_demo_data=https://zenodo.org/record/51705/files/APRICOT-1.0-demo_files-MS.zip
wget $zenodo_link_for_demo_data
unzip APRICOT-1.0-demo_files-MS.zip
cp -r APRICOT-1.0-demo_files-MS/apricot_demo_files/go_mapping/* $DB_PATH
cp -r APRICOT-1.0-demo_files-MS/apricot_demo_files/interpro_annotation_data $DB_PATH/interpro
cp APRICOT-1.0-demo_files-MS/apricot_demo_files/cdd_analysis/* $ANALYSIS_PATH/output/0_predicted_domains/cdd_analysis
cp APRICOT-1.0-demo_files-MS/apricot_demo_files/ipr_analysis/* $ANALYSIS_PATH/output/0_predicted_domains/ipr_analysis
## CDD annotation table
wget -c -P $DB_PATH/cdd/cdd_annotation_data ftp://ftp.ncbi.nih.gov/pub/mmdb/cdd/cddid.tbl.gz
gunzip $DB_PATH/cdd/cdd_annotation_data/*
## PfamA annotation table
pfam_release=Pfam30.0
wget -c -P $DB_PATH/pfam ftp://ftp.ebi.ac.uk/pub/databases/Pfam/releases/$pfam_release/database_files/pfamA.txt.gz
gunzip $DB_PATH'/pfam/pfamA.txt.gz'
}
Providing query proteins using the subcommand query
¶
In this analysis we will identify protein that contains RRM/RNP, which is a family of conserved RNA-binding domains.
Users can provide query proteins using various options. Here we use the
input option as UniProt identifiers (ID) and provide 2 query proteins:
P0A6X3
,P00957
.
The protein ID P0A6X3
is used as an positive example. It represents
the Hfq protein that contains sm and RRM/RNP like domain. The protein ID
P00957
is used as an negative example which represents the alaS that
contains a domain related to tRNA-ligase therefore, it must not be
identified with domains reltaed to RRM/RNP.
## *REQUIRED* INPUT-1: provide comma separated list of UniProt ids##
query_uids='P0A6X3,P00957' # Option-1: UniProt identifier
apricot query --analysis_path APRICOT_analysis --uids $query_uids
Providing keyword inputs using the subcommand keywords
¶
There are two sets of keyword inputs: the first-set is required to selects domains from domain databases and the second set, which is an optional input, should comprise of terms that will classify our results accordingly.
In this turtorial we are interested in identifying RRM
containing
protein, hence, we will use RRM
as one of the terms. Additionally,
we can provide other terms (for example, KH
and DEAD
) to see if
there are domains associated with those terms could be identified in our
query proteins.
For the second set, we have listed few enzymes along with RRM and RNP.
### *REQUIRED* Input-2, comma separated list of keywords for domain selection
domain_kw='RRM,KH,DEAD'
### *OPTIONAL* Input-2, comma separated list of keywords for protein classification based on the predicted domains
class_kw='ribosom,helicase,synthetase,polymerase,transferase,nuclease,RRM,RNP'
apricot keywords --db_root source_files $domain_kw -cl $class_kw
These keywords are listed in the files
keywords_for_domain_selection.txt
and
keywords_for_result_classification.txt
present in the folder
source_files
as shown below.
source_files
└───├domain_data
keywords_for_domain_selection.txt
keywords_for_result_classification.txt
Selecting domains of interest using the subcommand select
¶
APRICOT uses the keywords provided in previous section and uses them to
select functional domains. Users can use -C
flag to select only CDD
related domains or -I
flag to select only InterPro related domains.
In this tutorial we will use the default option, which selects domains from both CDD and InterPro databases.
apricot select --db_root source_files
The selected domains are saved in these respective locations as ashown below.
source_files
└───├domain_data
└───├cdd # Containes individual files generated for each keyword related domains
| key1_related_cdd_domains.tab
| key2_related_cdd_domains.tab
| ...
└───├interpro # Containes individual files generated for each keyword related domains
| key1_related_interpro_domains.tab
| key2_related_interpro_domains.tab
| ...
├all_keyword_selected_domain_data.tab # All the keyword selected domains are mapped and combined
Domain predictions using the subcommand predict
¶
This subcommand uses one of the core modules of APRICOT, which predicts all the possible domains in the query domains.
For this analysis, APRICOT uses both the databases by default, however
single database can be defined by using flag -C
or -I
for CDD
and InterPro respectively. In case users provide the input as fasta
files, the query
subcommand can be skipped and the prediction of
domains can be carried out directly, however in that case the path for
fasta files (FASTA_PATH) can be provided using the option
--fasta $FASTA_PATH
.
APRICOT skips re-prediction of the domains when the output files are
present for the analysis of each query proteins, however users can use
the flag -F
to force the re-analysis.
apricot predict --analysis_path APRICOT_analysis
Please note that we are using default directory structure for this
tutorial, however, it is possible to define locations of the databases
by using options --cdd_db $CDD_PATH
and --ipr_db $INTERPRO_PATH
for CDD and interpro respectively.
In this tutorial, to avoid the use of domain databases and tools (avoid the prediction of domains), we provided the output files generated by domain prediction analysis and copied to the required subfolders as shown below.
APRICOT_analysis
└───├output
└───├0_predicted_domains # Location for storing files generated by domain prediction
└───├cdd_analysis # Domains predicted for each protein using CDD datasets
| P00957.txt
| P0A6X3.txt
|
└───├ipr_analysis # Domains predicted for each protein using InterPro datasets
P00957.tsv
P0A6X3.tsv
Filtering the predicted domains using the subcommand filter
¶
All the domains (predicted by predict
) undergoes a filtering step in
order to select proteins that are predicted with the domains of interest
(selected from databases using select
).
In this tutorial we have used the default parameters (refer
documentation) with their optimal cut-offs (--similarity 24
for 24 %
minimum similarity between the reference and predicted domain and
--coverage 39
for 39 % minimum domain coverage in the query
protein), however users can provide cut-offs for different parameters
(refer documentation).
apricot filter --analysis_path APRICOT_analysis \
--similarity 24 --coverage 39
The files obtained from this analysis are stored in the subfolder
1_compiled_domain_information
in the main analysis folder.
Additionally, the information of the proteins obtained from uniProt are
combined and store in the folder combined_data. These file location are
as shown below.
APRICOT_analysis
└───├output
└───├1_compiled_domain_information # Formatted flat files containg domain information
| └───├selected_data # Files containing proteins that contain domains of interest with the predcited domains
| | cdd_filtered.csv # Proteins containing CDD domains of interest
| | cdd_filtered_id.csv # Proteins IDs containing CDD domains IDs of interest
| | ipr_filtered.csv # Proteins containing InterPro domains of interest
| | ipr_filtered_id.csv # Proteins IDs containing InterPro domains IDs of interest
| |
| └───├unfiltered_data # All the domains predicted in the query proteins (unfiltered)
| cdd_unfiltered_all_prediction.csv
| ipr_unfiltered_all_prediction.csv
|
└───├2_selected_domain_information
└───├combined_data # Annotation extended for the selected proteins
annotation_extended_for_selected.csv
Classify all the selected domains from previous analysis using the subcommand classify
¶
All the selected proteins with their domains (selected by filter
)
are classified into smaller subsets to help navigating the output files.
This classification uses the keywords provided by users, which are
either explicitely defined (-cl
flag in keywords
) or are used
for the domain selection.
apricot classify --analysis_path APRICOT_analysis
The list of proteins that are selected based on the domains of interest
is classified when the annotations contain one of the terms. The
classified files are stored in the subfolder
2_selected_domain_information
as shown below.
APRICOT_analysis
└───├output
└───├2_selected_domain_information # Selected data classified into smaller subsets based on the keyword input
└───├classified_data
RNP_selected_data.csv
RRM_selected_data.csv
Calculating annotation scores for the selected domains using the subcommand annoscore
¶
This subcommand uses another important module of APRICOT to calculate annotation-based scores for each predicted domains in the query proteins. Please refer documentation to understand different sets of features, which have been used in APRICOT for the scoring of the predicted domains with respect to their reference consensus.
This module require python packages: numpy and scipy, and EMBOSS suit to
run Needleman Wunsch pairwise-alignment analysis. If not installed,
please install it using the function get_emboss
. For basic
demonstration, we suggest you to skip this module as the configuration
and installation of EMBOSS suite takes time. However, if installed
already, please edit the path name NEEDLE_EMBOSS_PATH
.
NEEDLE_EMBOSS_PATH=/home/emboss/needle # Default path
apricot annoscore --analysis_path APRICOT_analysis --needle_dir $NEEDLE_EMBOSS_PATH
The files generated from this analysis are stored in the subfolder
3_annotation_scoring in the analysis
folder as shown below.
APRICOT_analysis
└───├output
└───├2_selected_domain_information
Generating analysis summary using the subcommand: summary
¶
Users can summarize the analysis result using this module. The summary file contains an overview of the entire analysis that includes, for example, the query proteins mapped to UniPro, total selected domains per keyword, summary of domain predictions and their selection.
apricot summary --analysis_path APRICOT_analysis
The summary file is stored in the subfolder 5_analysis_summary
analysis folder as shown below.
APRICOT_analysis
└───├output
└───├5_analysis_summary
APRICOT_analysis_summary.csv # Summary file
Format output files using the subcommand format
¶
APRICOT by default produces output files in comma-separated values
(.csv). Users can convert these files to HTML format using -HT
flag
or excel format (.xlsx) using -XL
flag, where the later one uses the
openpyxl
python module.
In this tutorial we have used -HT
option.
apricot format --analysis_path APRICOT_analysis -HT
All the files in the format of selection are stored in the subfolder
format_output_data
in the main analysis folder as shown below.
APRICOT_analysis
└───├output
└───├format_output_data
└───├html_files # Output files for each folder in HTML format
| └───├0_predicted_domains
| └───├1_compiled_domain_information
| └───├2_selected_domain_information
| └───├3_annotation_scoring
| └───├4_additional_annotations
| └───├5_analysis_summary
|
└───├excel_files # Output files for each folder in Excel format
└───├ ...
All of the abovementioned modules can be run in an automated streamlined manner using the provided shell script as shown below.
$ sh run_example.sh
This concludes the tutorial for the analysis conducted by APRICOT.