Gene name matching (gene
)¶
Genes can have multiple aliases. When we combine data from different sources, for example expression data with GO gene sets, we have to match gene aliases representing the same genes. All implemented matching methods are based on sets of gene aliases.
Gene matchers in this module match genes to a user-specified set of target
gene names. For gene matching, initialize a gene matcher (Matcher
),
set the target gene names with set_targets
, and then
match with match
or umatch
functions. The
following example (genematch1.py
)
matches gene names to NCBI gene IDs:
import orangecontrib.bio.gene
#matching targets are NCBI gene IDs
targets = orangecontrib.bio.gene.NCBIGeneInfo("Homo sapiens").keys()
gm = orangecontrib.bio.gene.GMNCBI("9606")
gm.set_targets(targets)
for gene in [ "cct7", "pls1", "gdi1", "nfkb2", "dlg7" ]:
print('Gene ' + gene + ' is NCBI gene ' + str(gm.umatch(gene)))
Gene name matching¶
The base class for all the following gene matcher is Matcher
.
-
class
orangecontrib.bio.gene.
Matcher
¶ Matches an input gene to some target gene (set in advance).
-
explain
(gene)¶ Return gene matches with explanations as lists of tuples: a list of matched target genes and the corresponding set of gene aliases.
-
match
(gene)¶ Return a list of target gene aliases which share a set of aliases with the input gene (can be empty).
-
set_targets
(targets)¶ Set input list of gene names (a list of strings) as target genes.
-
umatch
(gene)¶ Return a single (unique) matching target gene or None, if there are no matches or multiple matches.
-
This modules provides the following gene matchers:
-
class
orangecontrib.bio.gene.
MatcherAliasesKEGG
(organism, ignore_case=True)¶ Alias: GMKEGG.
-
class
orangecontrib.bio.gene.
MatcherAliasesGO
(organism, ignore_case=True)¶ Alias: GMGO.
-
class
orangecontrib.bio.gene.
MatcherAliasesDictyBase
(ignore_case=True)¶ Alias: GMDicty.
-
class
orangecontrib.bio.gene.
MatcherAliasesNCBI
(organism, ignore_case=True)¶ Alias: GMNCBI.
-
class
orangecontrib.bio.gene.
MatcherAliasesEnsembl
(organism, **kwargs)¶ A matcher for Ensemble ids. Alias: GMEnsemble.
-
class
orangecontrib.bio.gene.
MatcherDirect
(ignore_case=True)¶ Directly match target names. Can ignore case. Alias: GMDirect.
Gene name matchers can be applied in sequence (until the first match) or
combined (overlapping sets of gene aliases of multiple gene matchers
are combined) with the matcher
function.
-
orangecontrib.bio.gene.
matcher
(matchers, direct=True, ignore_case=True)¶ Builds a new matcher from a list of gene matchers. Apply matchers in the input list successively until a match is found. If an element of matchers is a list, combine matchers in the sublist by joining overlapping sets of aliases.
Parameters: - matchers (list) – Gene matchers.
- direct (bool) – If True, first try
to match gene directly (a
MatcherDirect
is inserted in front of the gene matcher sequence). - ignore_case (bool) – passed to the added direct matcher.
The following example tries to match input genes onto KEGG gene aliases
(genematch2.py
).
import orangecontrib.bio.kegg
import orangecontrib.bio.gene
targets = orangecontrib.bio.kegg.KEGGOrganism("9606").get_genes() #KEGG gene IDs
gmkegg = orangecontrib.bio.gene.GMKEGG("9606")
gmgo = orangecontrib.bio.gene.GMGO("9606")
gmkegggo = orangecontrib.bio.gene.matcher([[gmkegg, gmgo]], direct=False) #joined matchers
gmkegg.set_targets(targets)
gmgo.set_targets(targets)
gmkegggo.set_targets(targets)
genes = [ "cct7", "pls1", "gdi1", "nfkb2", "a2a299" ]
print("%12s %12s %12s %12s" % ( "gene", "KEGG", "GO", "KEGG+GO" ))
for gene in genes:
print("%12s %12s %12s %12s" % \
(gene, gmkegg.umatch(gene), gmgo.umatch(gene), gmkegggo.umatch(gene)))
Results show that GO aliases can not match onto KEGG gene IDs. For the last gene only joined GO and KEGG aliases produce a match:
gene KEGG GO KEGG+GO
cct7 hsa:10574 None hsa:10574
pls1 hsa:5357 None hsa:5357
gdi1 hsa:2664 None hsa:2664
nfkb2 hsa:4791 None hsa:4791
a2a299 None None hsa:7052
The following example finds KEGG pathways with given genes
(genematch_path.py
).
import orangecontrib.bio.kegg
import orangecontrib.bio.gene
keggorg = orangecontrib.bio.kegg.KEGGOrganism("mmu")
kegg_genes = keggorg.get_genes()
query = [ "Fndc4", "Itgb8", "Cdc34", "Olfr1403" ]
gm = orangecontrib.bio.gene.GMKEGG("mmu") #use KEGG aliases for gene matching
gm.set_targets(kegg_genes) #set KEGG gene aliases as targets
for name in query:
match = gm.umatch(name)
if match:
pwys = keggorg.get_pathways_by_genes([match])
print(name + " is in")
pathways = [ orangecontrib.bio.kegg.KEGGPathway(p).title for p in pwys ]
if pathways:
for a in pathways:
print(' ' + a)
else:
print(' /')
Output:
Fndc4 is in
/
Itgb8 is in
PI3K-Akt signaling pathway
Focal adhesion
ECM-receptor interaction
Cell adhesion molecules (CAMs)
Regulation of actin cytoskeleton
Hypertrophic cardiomyopathy (HCM)
Arrhythmogenic right ventricular cardiomyopathy (ARVC)
Dilated cardiomyopathy
Cdc34 is in
Ubiquitin mediated proteolysis
Herpes simplex infection
Olfr1403 is in
Olfactory transduction