Once BLAST results are imported in the database using mdb-import-BLAST-alignments (see mdb-import-BLAST-alignments: Importing BLAST alignment results) the following examples will show you how to proceed to retrieve the best hits of any given sequence.
To understand those examples you need to know the difference between internal and external hits by reading the documentation of the mdb-import-BLAST-alignments tool.
Any alignment between a query and a hit sequence will result in the following properties being stored:
| Property | Value |
|---|---|
| type | Type of relationship (always ‘similar-to’) |
| run.date.year | Year the BLAST run was completed |
| run.date.month | Month the BLAST run was completed |
| run.date.day | Day the BLAST run was completed |
| run.algorithm.name | Name of the algorithm (e.g., ‘BLASTX’) |
| run.algorithm.version | Version of the algorithm (e.g., ‘2.2.22+’) |
| run.algorithm.parameters.expect | Expect value cutoff (BLAST -e option) |
| run.algorithm.parameters.matrix | Matrix used (BLAST -M option) |
| run.algorithm.parameters.gap_open | Gap existence cost (BLAST -G option) |
| run.algorithm.parameters.gap_extend | Gap extension cost (BLAST -E option) |
| run.algorithm.parameters.sc_match | Match score for nucleotide-nucleotide comparison (BLAST -r option) |
| run.algorithm.parameters.sc_mismatch | Mismatch penalty for nucleotide-nucleotide comparison (BLAST -r option) |
| run.algorithm.parameters.filter | Filtering options (BLAST -F option) |
| run.database.name | Database name |
| run.database.number_of_sequences | Database size, as the number of sequences |
| run.database.number_of_letters | Database size, as the number of letters |
| score.fraction_identical | Percent of identities in the HSP |
| score.fraction_conserved | Percent of positive substitutions in the HSP |
| score.e_value | Except value of the HSP |
| score.gaps | Number of gaps in the HSP |
| alignment.source_coordinates | Coordinates of the alignment in the query sequence |
| alignment.source | Alignment string for the query |
| alignment.match | Formatting middle line |
| alignment.target | Alignment string for the hit |
| alignment.target_coordinates | Coordinates of the alignment in the hit sequence |
Alignments against external hits will also include the following:
| Property | Value |
|---|---|
| hit.name | Identifier of the hit |
| hit.description | Description of the hit |
| hit.length | Length of the hit |
Alignments against internal hits are stored as properties of the relationship between the query and hit.
Alignments against external hits are stored as a list under the property alignments of the query sequence.
Alignments against internal hits are stored as annotations of the relationship between the query and each hit:
# we retrieve a sequence as our query
query = mdb.Sequence.find_one({"name": "my query sequence 1"})
# hits will be represented as related sequences; they can be accessed
# by listing all sequences 'query' relates to. As we are interested in
# alignments only, we filter those neighboring sequences to those with
# a 'similar-to' type of relationship:
for hit in query.list_related_sequences(mdb.Direction.OUTGOING, {}, {"type": "similar-to"}):
# once a hit sequence located, each HSP is represented as a
# distinct annotated relationship between 'query' and 'hit':
for hsp in query.list_relationships_with(hit):
# we print the hit name and the HSP E-value
print "\nHit: '%s' (E-value: %.2g)" % (hit["name"], hsp["score"]["e_value"])
# we retrieve the HSP sequences and coordinates
print "%s %s" % (hsp["alignment"]["source"], hsp["alignment"]["source_coordinates"])
print hsp["alignment"]["match"]
print "%s %s" % (hsp["alignment"]["target"], hsp["alignment"]["target_coordinates"])
Alignments against external hits are stored as a alignments property of the query sequence:
# we retrieve a sequence as our query, ensuring it does have a 'alignments' property
query = mdb.Sequence.find_one({"alignments": {"$exists": True}})
print "Query: '%s'" % query["name"]
# for each HSP this query have with external hits,
for hsp in query["alignments"]:
# we print the hit name and the HSP E-value
print "\nHit: '%s' (E-value: %.2g)" % (hsp["hit"]["name"], hsp["score"]["e_value"])
# we retrieve the HSP sequences and coordinates
print "%s %s" % (hsp["alignment"]["source"], hsp["alignment"]["source_coordinates"])
print hsp["alignment"]["match"]
print "%s %s" % (hsp["alignment"]["target"], hsp["alignment"]["target_coordinates"])
Output (excerpt):
Hit: 'gi|291327153|ref|ZP_06574216.1|' (E-value: 0.00018)
MSAFSLPIPPRSLATPLHQLRERSAT--------ALSSPKLRWAA*APLHFPRRIA*PVSCYAFFK [182, 355]
MSAF+L IPP L LH+L ERS T A S L APLH PRR PVS YAFFK
MSAFALLIPPACLTAHLHRLTERSPTQQYLHIAAAASVHSL-----APLHLPRRPTRPVSYYAFFK [1, 61]
Hit: 'gi|6460058|gb|AAF11800.1|AE002057_8' (E-value: 0.00018)
TWFPSTTPFGLALGAG*PCAD*LYAGTLGLSARGSLTLFVATHVSILTSHTSTESRDSASPA*GTLRYRSF----EPEASVGGLSPVTF [228, 482]
T PS PFGL LG P AD GTL L+A+ LT F+ TH I TS ST ASP R E ASV LSP F
TCCPSAAPFGLTLGPDFPWADDPSPGTLVLTAKKILTSFIVTHAGIRTSVGSTTPSGMASPRTERSPTRQLASRVESIASVDHLSPDHF [2, 90]
Hit: 'gi|6460058|gb|AAF11800.1|AE002057_8' (E-value: 0.0026)
PRSLATPLHQLRERSATALSSPKLRWAA*APLHFPRRIA*PVSCYAFFK*WLLLSQHPGC [149, 328]
PR+ +P QL R + S L +P HF R + PVS YA F+ WLLLSQ PGC
PRTERSPTRQLASRVESIASVDHL-----SPDHFRRIVTRPVSYYALFEGWLLLSQPPGC [62, 116]
Note
The major difference between iterating through internal and external hits is the way multiple HSPs between a query and a hit are represented. For internal hits HSPs are represented as multiple relationships between a query and hit object. For external hits HSPs are represented as multiple entries in the alignments property of the query object. Another difference is, for the later case, the presence of the hit.name, hit.description and hit.length properties.
To search for the best external hit of each query sequence among all BLAST runs that have been imported you can use the following code:
# first, we select the collection that contains our queries
queries = mdb.Collection.find_one({"name": "..."})
# for each query sequence in 'queries' that have a 'alignments'
# property (meaning it have been annotated with BLAST results against
# sequences external to the database),
for query in queries.list_sequences({"alignments": {"$exists": True}}):
# we set an initial best E-value to an absurd high value
# and prepare 'best_hit' to receive the best hit (and possible ties)
best_evalue = 100
best_hit = []
# then, for each HSP against external hits (regardless of the BLAST run),
for hsp in query["alignments"]:
# we retrieve the E-value
evalue = hsp["score"]["e_value"]
# if this E-value is better (i.e., lower) than the
# previously known best E-value we store the hit as
# the new best one
if (evalue < best_evalue):
# if true, we store this new best hit
best_hit = [hsp]
best_evalue = evalue
# if the E-value is the same as the best one known
# so far, we store this hit as a tie
elif (evalue == best_evalue):
best_hit.append(hsp)
# at this stage, 'best_hit' contains the best hit (or list
# of best hits in case of ties) for the sequence in 'query':
print "Best hit(s) for %s (with E-value %.2g):" % (query, best_evalue)
print best_hit["hit"]["name"]
If you want to distinguish between BLAST runs the code becomes:
import pickle
# first, we select the collection that contains our queries
queries = mdb.Collection.find_one({"name": "..."})
# for each query sequence in 'queries' with known external hits,
for query in queries.list_sequences({"alignments": {"$exists": True}}):
# we create two dictionaries which, for each BLAST
# run, will store the best hit and E-value
best_hit = {}
best_evalue = {}
# for each HSP against an external hit,
for hsp in query["alignments"]:
# we create a BLAST run signature; i.e., a value that is
# guaranteed to be uniquely associated to each run. This
# signature will be used to keep track of the best hits
# for each run
run_signature = pickle.dumps(hsp["run"])
# if we see this BLAST run for the first time, we set
# its best hit to dummy values
if (not run_signature in best_hit):
best_hit[run_signature] = []
best_evalue[run_signature] = 100
evalue = hsp["score"]["e_value"]
# if the E-value is lower than the previously known
# best one, we store the new best hit for this run
if (evalue < best_evalue[run_signature]):
best_hit[run_signature] = [hsp]
best_evalue[run_signature] = evalue
# in case of tie, we store the additional best hit
elif (evalue == best_evalue[run_signature]):
best_hit[run_signature].append(hsp)
# we can now print the best hit for each one of the BLAST
# this query sequence has be run through:
for run_signature in best_hit:
run = pickle.loads(run_signature)
print "Best hit(s) for %s according to run %s" % (query, run)
print best_hit[run_signature]["hit"]["name"], best_evalue[run_signature]
If you are interested in the n best hits (instead of just the best one) across all BLAST runs the code becomes:
# number of best hits we want to keep
N = 10
# for each query sequence with known external hits,
for query in queries.list_sequences({"alignments": {"$exists": True}}):
best_hits = [[]]
best_evalues = [100]
# for each HSP against an external hit,
for hsp in query["alignments"]:
evalue = hsp["score"]["e_value"]
# if the E-value of this HSP is lower than the worst
# E-value among the N best hits stored in 'best_hits',
if (evalue < max(best_evalues)):
# we add this hit to the list of best ones
best_hits.append([hsp])
best_evalues.append(evalue)
# if the E-value of this HSP is already part of the
# best E-values,
elif (evalue in best_evalues):
# we find the one hit among the best ones
# with the same E-value as the current hit,
p = None
for i in range(len(best_evalues)):
if (best_evalues[i] == evalue):
p = i
break
# then we store the current hit as a tie
best_hits[p].append(hsp)
# if the list of best hits is bigger than N,
if (len(best_hits) > N):
# we create a combined list of the best hits
# and their E-values
best = zip(best_hits, best_evalues)
# and sort this list by increasing E-value
best.sort(lambda x, y: cmp(x[1], y[1]))
# then we remove all hits above the Nth
best = list[:N]
# finally, we repopulate 'best_hits' and
# 'best_evalues' with the remainder
best = [list(l) for l in zip(*best)]
best_hits, best_evalues = best
print "Best %s hits for %s:" % (N, query)
for best_hit, best_evalue in zip(best_hits, best_evalues):
print best_hit["hit"]["name"], best_evalue
As above, the code has to be modified if you want to distinguish between BLAST runs:
N = 10
for query in queries.list_sequences({"alignments": {"$exists": True}}):
best_hits = {}
best_evalues = {}
for hsp in query["alignments"]:
run_signature = pickle.dumps(hsp["run"])
if (not run_signature in best_hits):
best_hits[run_signature] = [[]]
best_evalues[run_signature] = [100]
evalue = hit["score"]["e_value"]
if (evalue < max(best_evalues[run_signature])):
best_hits[run_signature].append([hsp])
best_evalues[run_signature].append(evalue)
elif (evalue in best_evalues[run_signature]):
p = None
for i in range(len(best_evalues[run_signature])):
if (best_evalues[run_signature][i] == evalue):
p = i
break
best_hits[run_signature][p].append(hsp)
if (len(best_hits[run_signature]) > N):
best = zip(best_hits[run_signature], best_evalues[run_signature])
best.sort(lambda x, y: cmp(x[1], y[1]))
best = best[:N]
best = [list(l) for l in zip(*best)]
best_hits[run_signature], best_evalues[run_signature] = best
for run_signature in best_hits:
run = pickle.loads(run_signature)
print "Best %s hits for %s according to run %s:" % (N, query, run)
for best_hit, best_evalue in zip(best_hits[run_signature], best_evalues[run_signature]):
print best_hit, best_evalue