|Latest Version| |License| |DOI| |image3| |image4| Tutorial ======== This tutorial provides an easy way to test different modules of APRICOT and understand the basic usage of this pipeline. Requirements: ~~~~~~~~~~~~~ 1) Get APRICOT software The software can be used via `Docker container <./docker.html>`__, or can be installed locally as explained `here <./installation.html>`__. 2) Get the Shell script for running the software Use wget or curl to download the script `run\_example.sh `__ inside the Docker container (when working with Docker) or locally (when using the locally installed software), which will assist you in following the tutorial. :: $ wget https://raw.githubusercontent.com/malvikasharan/APRICOT/master/shell_scripts/run_example.sh OR :: $ curl -O https://raw.githubusercontent.com/malvikasharan/APRICOT/master/shell_scripts/run_example.sh Not required for this tutorial ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ APRICOT requires `supporting data <./docker.html>`__ in order to carry out analysis of the user provided queries. In addition, for carrying out additional annotation of the proteins with biological features, it uses locally installed Psort and RaptorX executable. These requirements have been discussed in detail in the `documentation <./software_requirements.html>`__. We will **NOT** require those for this tutorial, instead we have provided pre-processed files that should allow you to test the functionality of the tool. The test files are avaialable at: `Zenodo record `__. Example analysis ~~~~~~~~~~~~~~~~ APRICOT can be executed by python (python3 or up) on Unix-flavors (GNU/Linux, BSD, OS X). We will go through the shell script `run\_example.sh `__ step by step. This script assumes that users have installed the software globally using pip. Check the subcommands: :: apricot -h or **If you are using locally installed git repository** Please specifiy the the path where APRICOT modules are located (see below: APRICOT\_PATH) and run the tool using ``python $APRICOT_PATH/bin/apricot`` by editing ``APRICOT_CALL="python3 ../bin/apricot"`` (which would have the advantage that you can specify the Python version you would like to use) or ``./$APRICOT_PATH/bin/apricot`` command. :: python $APRICOT_PATH/bin/apricot -h If you get an error message ``system_test.sh: 164: system_test.sh: ../bin/apricot: Permission denied ...``, please open the script ``run_example.sh`` and and edit the ``APRICOT_CALL="../bin/apricot"`` to ``APRICOT_CALL="python3 ../bin/apricot"`` (instead of python3, you can use python or define any version of python, or instead use ./). Defining paths ^^^^^^^^^^^^^^ Users can set the following paths: :: ## User defined PATHS (default paths are given) # Path where the analysis data by APRICOT will be stored ANALYSIS_PATH=APRICOT_analysis # Path where the source files and databases will be stored ROOT_DB_PATH=source_files # Path for EMBOSS suite with locally installed needle cline NEEDLE_EMBOSS_PATH=/home/emboss/needle # Default path # Path where APRICOT modules are located, by-default we use APRICOT as # the name of git library APRICOT_PATH=APRICOT APRICOT stores databases and source files in these paths. :: ## FIXED PATHS # Source data downloaded by APRICOT DB_PATH=$ROOT_DB_PATH/reference_db_files ## PATHS for domain databases CDD_PATH=$DB_PATH/cdd/Cdd INTERPRO_PATH=$DB_PATH/interpro/interproscan Creating analysis folders ^^^^^^^^^^^^^^^^^^^^^^^^^ 1) We will create each of the paths that we defined above in the section above: :: donwload_files(){ mkdir -p $ROOT_DB_PATH $APRICOT_PATH $ANALYSIS_PATH $APRICOT_LIBRARY for db_subpath in cdd go_mapping interpro pfam do mkdir -p $DB_PATH/$db_subpath done [...] } This will create a main folder ``source_files`` with subfolder as shown below. :: source_files └───├domain_data # Location for the files containing keywords for domain selection and subsequently selected domains | └───├emboss # EMBOSS package containing needle software | └───├reference_db_files └───├all_taxid # Taxonomy ids for the reference of proteome analysis └───├blast # BLAST package containing required executables └───├cdd # Cdd related reference files | └───├Cdd # Cdd database (Not required for the tutorial) | └───├cdd_annotation_data # Cdd related annotation file └───├pdb # All pdb structures as a reference for the tertiary structures └───├pfam # Pfam annotation data └───├interpro # InterPro related reference files | └───├interproscan # Interpro database and InterProScan related tools (Not required for the tutorial) | └───├interpro_annotation_data # interPro related annotation files └───├go_mapping # GO related data containing GO anotation for the domains obtained from CDD and InterPro 2) Using the ``create`` subcommand from APRICOT, we will create the analysis folder and its subfolder. This generates a main folder ``APRICOT_analysis`` (name can be defined by users), which contains subfolders as shown below. :: apricot create APRICOT_analysis :: APRICOT_analysis └───├input # Location used by subcommand 'query' to store all the related files | └───├query_proteins # Location for the list of query proteins | | | └───├uniprot_reference_table # Location for storing the reference table downloaded from UniProt | | | └───├mapped_query_annotation # All the query proteins that are mapped to at least one UniProt annotations | | ├fasta_path_mapped_query # Fasta files obtained for all the query proteins, can be used for query fasta files | | ├xml_path_mapped_query # UniProt xml files obtained for all the query proteins | └───├output └───├0_predicted_domains # Location for the output data obtained from the subcommand 'predict' └───├1_compiled_domain_information # Location for the output data obtained from the subcommand 'filter' └───├2_selected_domain_information └───├3_annotation_scoring # Location for the output data obtained from the subcommand 'annoscore' └───├4_additional_annotations # Location for additional annotations for the selected | # queries using subcommand 'addanno' └───├5_analysis_summary # Location for the output data obtained from the subcommand 'summary' └───├format_output_data # Location for the output data obtained from the subcommand 'format' └───├visualization_files # Location for the output data obtained from the subcommand 'vis' Fetching required source files ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ We have provided a few pre-processed files, which are available in *apricot\_demo\_files* folder (retrieved from zenodo), if it is already available, please comment the first three lines in the script, which copies these files to the APRICOT defined paths. Additionally, we will download domain annotation files from CDD and Pfam databases. :: complete_data_for_demo(){ ## Here DB_PATH=source_files/reference_db_files and $ANALYSIS_PATH=APRICOT_analysis zenodo_link_for_demo_data=https://zenodo.org/record/51705/files/APRICOT-1.0-demo_files-MS.zip wget $zenodo_link_for_demo_data unzip APRICOT-1.0-demo_files-MS.zip cp -r APRICOT-1.0-demo_files-MS/apricot_demo_files/go_mapping/* $DB_PATH cp -r APRICOT-1.0-demo_files-MS/apricot_demo_files/interpro_annotation_data $DB_PATH/interpro cp APRICOT-1.0-demo_files-MS/apricot_demo_files/cdd_analysis/* $ANALYSIS_PATH/output/0_predicted_domains/cdd_analysis cp APRICOT-1.0-demo_files-MS/apricot_demo_files/ipr_analysis/* $ANALYSIS_PATH/output/0_predicted_domains/ipr_analysis ## CDD annotation table wget -c -P $DB_PATH/cdd/cdd_annotation_data ftp://ftp.ncbi.nih.gov/pub/mmdb/cdd/cddid.tbl.gz gunzip $DB_PATH/cdd/cdd_annotation_data/* ## PfamA annotation table pfam_release=Pfam30.0 wget -c -P $DB_PATH/pfam ftp://ftp.ebi.ac.uk/pub/databases/Pfam/releases/$pfam_release/database_files/pfamA.txt.gz gunzip $DB_PATH'/pfam/pfamA.txt.gz' } Providing query proteins using the subcommand ``query`` ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ In this analysis we will identify protein that contains RRM/RNP, which is a family of conserved RNA-binding domains. Users can provide query proteins using various options. Here we use the input option as UniProt identifiers (ID) and provide 2 query proteins: ``P0A6X3``,\ ``P00957``. The protein ID ``P0A6X3`` is used as an positive example. It represents the Hfq protein that contains sm and RRM/RNP like domain. The protein ID ``P00957`` is used as an negative example which represents the alaS that contains a domain related to tRNA-ligase therefore, it must not be identified with domains reltaed to RRM/RNP. :: ## *REQUIRED* INPUT-1: provide comma separated list of UniProt ids## query_uids='P0A6X3,P00957' # Option-1: UniProt identifier apricot query --analysis_path APRICOT_analysis --uids $query_uids Providing keyword inputs using the subcommand ``keywords`` ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ There are two sets of keyword inputs: the first-set is required to selects domains from domain databases and the second set, which is an optional input, should comprise of terms that will classify our results accordingly. In this turtorial we are interested in identifying ``RRM`` containing protein, hence, we will use ``RRM`` as one of the terms. Additionally, we can provide other terms (for example, ``KH`` and ``DEAD``) to see if there are domains associated with those terms could be identified in our query proteins. For the second set, we have listed few enzymes along with RRM and RNP. :: ### *REQUIRED* Input-2, comma separated list of keywords for domain selection domain_kw='RRM,KH,DEAD' ### *OPTIONAL* Input-2, comma separated list of keywords for protein classification based on the predicted domains class_kw='ribosom,helicase,synthetase,polymerase,transferase,nuclease,RRM,RNP' apricot keywords --db_root source_files $domain_kw -cl $class_kw These keywords are listed in the files ``keywords_for_domain_selection.txt`` and ``keywords_for_result_classification.txt`` present in the folder ``source_files`` as shown below. :: source_files └───├domain_data keywords_for_domain_selection.txt keywords_for_result_classification.txt Selecting domains of interest using the subcommand ``select`` ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ APRICOT uses the keywords provided in previous section and uses them to select functional domains. Users can use ``-C`` flag to select only CDD related domains or ``-I`` flag to select only InterPro related domains. In this tutorial we will use the default option, which selects domains from both CDD and InterPro databases. :: apricot select --db_root source_files The selected domains are saved in these respective locations as ashown below. :: source_files └───├domain_data └───├cdd # Containes individual files generated for each keyword related domains | key1_related_cdd_domains.tab | key2_related_cdd_domains.tab | ... └───├interpro # Containes individual files generated for each keyword related domains | key1_related_interpro_domains.tab | key2_related_interpro_domains.tab | ... ├all_keyword_selected_domain_data.tab # All the keyword selected domains are mapped and combined Domain predictions using the subcommand ``predict`` ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ This subcommand uses one of the core modules of APRICOT, which predicts all the possible domains in the query domains. For this analysis, APRICOT uses both the databases by default, however single database can be defined by using flag ``-C`` or ``-I`` for CDD and InterPro respectively. In case users provide the input as fasta files, the ``query`` subcommand can be skipped and the prediction of domains can be carried out directly, however in that case the path for fasta files (FASTA\_PATH) can be provided using the option ``--fasta $FASTA_PATH``. APRICOT skips re-prediction of the domains when the output files are present for the analysis of each query proteins, however users can use the flag ``-F`` to *force* the re-analysis. :: apricot predict --analysis_path APRICOT_analysis Please note that we are using default directory structure for this tutorial, however, it is possible to define locations of the databases by using options ``--cdd_db $CDD_PATH`` and ``--ipr_db $INTERPRO_PATH`` for CDD and interpro respectively. In this tutorial, to avoid the use of domain databases and tools (avoid the prediction of domains), we provided the output files generated by domain prediction analysis and copied to the required subfolders as shown below. :: APRICOT_analysis └───├output └───├0_predicted_domains # Location for storing files generated by domain prediction └───├cdd_analysis # Domains predicted for each protein using CDD datasets | P00957.txt | P0A6X3.txt | └───├ipr_analysis # Domains predicted for each protein using InterPro datasets P00957.tsv P0A6X3.tsv Filtering the predicted domains using the subcommand ``filter`` ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ All the domains (predicted by ``predict``) undergoes a filtering step in order to select proteins that are predicted with the domains of interest (selected from databases using ``select``). In this tutorial we have used the default parameters (refer documentation) with their optimal cut-offs (``--similarity 24`` for 24 % minimum similarity between the reference and predicted domain and ``--coverage 39`` for 39 % minimum domain coverage in the query protein), however users can provide cut-offs for different parameters (refer documentation). :: apricot filter --analysis_path APRICOT_analysis \ --similarity 24 --coverage 39 The files obtained from this analysis are stored in the subfolder ``1_compiled_domain_information`` in the main analysis folder. Additionally, the information of the proteins obtained from uniProt are combined and store in the folder combined\_data. These file location are as shown below. :: APRICOT_analysis └───├output └───├1_compiled_domain_information # Formatted flat files containg domain information | └───├selected_data # Files containing proteins that contain domains of interest with the predcited domains | | cdd_filtered.csv # Proteins containing CDD domains of interest | | cdd_filtered_id.csv # Proteins IDs containing CDD domains IDs of interest | | ipr_filtered.csv # Proteins containing InterPro domains of interest | | ipr_filtered_id.csv # Proteins IDs containing InterPro domains IDs of interest | | | └───├unfiltered_data # All the domains predicted in the query proteins (unfiltered) | cdd_unfiltered_all_prediction.csv | ipr_unfiltered_all_prediction.csv | └───├2_selected_domain_information └───├combined_data # Annotation extended for the selected proteins annotation_extended_for_selected.csv Classify all the selected domains from previous analysis using the subcommand ``classify`` ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ All the selected proteins with their domains (selected by ``filter``) are classified into smaller subsets to help navigating the output files. This classification uses the keywords provided by users, which are either explicitely defined (``-cl`` flag in ``keywords``) or are used for the domain selection. :: apricot classify --analysis_path APRICOT_analysis The list of proteins that are selected based on the domains of interest is classified when the annotations contain one of the terms. The classified files are stored in the subfolder ``2_selected_domain_information`` as shown below. :: APRICOT_analysis └───├output └───├2_selected_domain_information # Selected data classified into smaller subsets based on the keyword input └───├classified_data RNP_selected_data.csv RRM_selected_data.csv Calculating annotation scores for the selected domains using the subcommand ``annoscore`` ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ This subcommand uses another important module of APRICOT to calculate annotation-based scores for each predicted domains in the query proteins. Please refer documentation to understand different sets of features, which have been used in APRICOT for the scoring of the predicted domains with respect to their reference consensus. This module require python packages: numpy and scipy, and EMBOSS suit to run Needleman Wunsch pairwise-alignment analysis. If not installed, please install it using the function ``get_emboss``. For basic demonstration, we suggest you to skip this module as the configuration and installation of EMBOSS suite takes time. However, if installed already, please edit the path name ``NEEDLE_EMBOSS_PATH``. :: NEEDLE_EMBOSS_PATH=/home/emboss/needle # Default path apricot annoscore --analysis_path APRICOT_analysis --needle_dir $NEEDLE_EMBOSS_PATH The files generated from this analysis are stored in the subfolder ``3_annotation_scoring in the analysis`` folder as shown below. :: APRICOT_analysis └───├output └───├2_selected_domain_information Generating analysis summary using the subcommand: ``summary`` ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ Users can summarize the analysis result using this module. The summary file contains an overview of the entire analysis that includes, for example, the query proteins mapped to UniPro, total selected domains per keyword, summary of domain predictions and their selection. :: apricot summary --analysis_path APRICOT_analysis The summary file is stored in the subfolder ``5_analysis_summary`` analysis folder as shown below. :: APRICOT_analysis └───├output └───├5_analysis_summary APRICOT_analysis_summary.csv # Summary file Format output files using the subcommand ``format`` ''''''''''''''''''''''''''''''''''''''''''''''''''' APRICOT by default produces output files in comma-separated values (.csv). Users can convert these files to HTML format using ``-HT`` flag or excel format (.xlsx) using ``-XL`` flag, where the later one uses the ``openpyxl`` python module. In this tutorial we have used ``-HT`` option. :: apricot format --analysis_path APRICOT_analysis -HT All the files in the format of selection are stored in the subfolder ``format_output_data`` in the main analysis folder as shown below. :: APRICOT_analysis └───├output └───├format_output_data └───├html_files # Output files for each folder in HTML format | └───├0_predicted_domains | └───├1_compiled_domain_information | └───├2_selected_domain_information | └───├3_annotation_scoring | └───├4_additional_annotations | └───├5_analysis_summary | └───├excel_files # Output files for each folder in Excel format └───├ ... All of the abovementioned modules can be run in an automated streamlined manner using the provided shell script as shown below. :: $ sh run_example.sh This concludes the tutorial for the analysis conducted by APRICOT. .. |Latest Version| image:: https://img.shields.io/pypi/v/bio-apricot.svg :target: https://pypi.python.org/pypi/bio-apricot/ .. |License| image:: https://img.shields.io/pypi/l/bio-apricot.svg :target: https://pypi.python.org/pypi/bio-apricot/ .. |DOI| image:: https://zenodo.org/badge/21283/malvikasharan/APRICOT.svg :target: https://zenodo.org/badge/latestdoi/21283/malvikasharan/APRICOT .. |image3| image:: https://images.microbadger.com/badges/image/malvikasharan/apricot.svg :target: https://microbadger.com/images/malvikasharan/apricot .. |image4| image:: https://raw.githubusercontent.com/malvikasharan/APRICOT/master/APRICOT_logo.png :target: http://malvikasharan.github.io/APRICOT/