Gene name matching (gene)

Genes can have multiple aliases. When we combine data from different sources, for example expression data with GO gene sets, we have to match gene aliases representing the same genes. All implemented matching methods are based on sets of gene aliases.

Gene matchers in this module match genes to a user-specified set of target gene names. For gene matching, initialize a gene matcher (Matcher), set the target gene names with set_targets, and then match with match or umatch functions. The following example (genematch1.py) matches gene names to NCBI gene IDs:

import orangecontrib.bio.gene

#matching targets are NCBI gene IDs
targets = orangecontrib.bio.gene.NCBIGeneInfo("Homo sapiens").keys()

gm = orangecontrib.bio.gene.GMNCBI("9606")
gm.set_targets(targets)

for gene in [ "cct7", "pls1", "gdi1", "nfkb2", "dlg7" ]:
    print('Gene ' + gene + ' is NCBI gene ' + str(gm.umatch(gene)))

Gene name matching

The base class for all the following gene matcher is Matcher.

class orangecontrib.bio.gene.Matcher

Matches an input gene to some target gene (set in advance).

explain(gene)

Return gene matches with explanations as lists of tuples: a list of matched target genes and the corresponding set of gene aliases.

match(gene)

Return a list of target gene aliases which share a set of aliases with the input gene (can be empty).

set_targets(targets)

Set input list of gene names (a list of strings) as target genes.

umatch(gene)

Return a single (unique) matching target gene or None, if there are no matches or multiple matches.

This modules provides the following gene matchers:

class orangecontrib.bio.gene.MatcherAliasesKEGG(organism, ignore_case=True)

Alias: GMKEGG.

class orangecontrib.bio.gene.MatcherAliasesGO(organism, ignore_case=True)

Alias: GMGO.

class orangecontrib.bio.gene.MatcherAliasesDictyBase(ignore_case=True)

Alias: GMDicty.

class orangecontrib.bio.gene.MatcherAliasesNCBI(organism, ignore_case=True)

Alias: GMNCBI.

class orangecontrib.bio.gene.MatcherAliasesEnsembl(organism, **kwargs)

A matcher for Ensemble ids. Alias: GMEnsemble.

class orangecontrib.bio.gene.MatcherDirect(ignore_case=True)

Directly match target names. Can ignore case. Alias: GMDirect.

Gene name matchers can be applied in sequence (until the first match) or combined (overlapping sets of gene aliases of multiple gene matchers are combined) with the matcher function.

orangecontrib.bio.gene.matcher(matchers, direct=True, ignore_case=True)

Builds a new matcher from a list of gene matchers. Apply matchers in the input list successively until a match is found. If an element of matchers is a list, combine matchers in the sublist by joining overlapping sets of aliases.

Parameters:
  • matchers (list) – Gene matchers.
  • direct (bool) – If True, first try to match gene directly (a MatcherDirect is inserted in front of the gene matcher sequence).
  • ignore_case (bool) – passed to the added direct matcher.

The following example tries to match input genes onto KEGG gene aliases (genematch2.py).

import orangecontrib.bio.kegg
import orangecontrib.bio.gene

targets = orangecontrib.bio.kegg.KEGGOrganism("9606").get_genes() #KEGG gene IDs

gmkegg = orangecontrib.bio.gene.GMKEGG("9606")
gmgo = orangecontrib.bio.gene.GMGO("9606")
gmkegggo = orangecontrib.bio.gene.matcher([[gmkegg, gmgo]], direct=False) #joined matchers

gmkegg.set_targets(targets)
gmgo.set_targets(targets)
gmkegggo.set_targets(targets)

genes = [ "cct7", "pls1", "gdi1", "nfkb2", "a2a299" ]

print("%12s %12s %12s %12s" % ( "gene", "KEGG", "GO", "KEGG+GO" ))
for gene in genes:
    print("%12s %12s %12s %12s" % \
        (gene, gmkegg.umatch(gene), gmgo.umatch(gene), gmkegggo.umatch(gene)))

Results show that GO aliases can not match onto KEGG gene IDs. For the last gene only joined GO and KEGG aliases produce a match:

  gene         KEGG           GO      KEGG+GO
  cct7    hsa:10574         None    hsa:10574
  pls1     hsa:5357         None     hsa:5357
  gdi1     hsa:2664         None     hsa:2664
 nfkb2     hsa:4791         None     hsa:4791
a2a299         None         None     hsa:7052

The following example finds KEGG pathways with given genes (genematch_path.py).

import orangecontrib.bio.kegg
import orangecontrib.bio.gene

keggorg = orangecontrib.bio.kegg.KEGGOrganism("mmu")
kegg_genes = keggorg.get_genes() 

query = [ "Fndc4", "Itgb8", "Cdc34", "Olfr1403" ] 

gm = orangecontrib.bio.gene.GMKEGG("mmu") #use KEGG aliases for gene matching
gm.set_targets(kegg_genes) #set KEGG gene aliases as targets

for name in query:
    match = gm.umatch(name)
    if match:
        pwys = keggorg.get_pathways_by_genes([match])
        print(name + " is in")
        pathways = [ orangecontrib.bio.kegg.KEGGPathway(p).title for p in pwys ]
        if pathways:
            for a in pathways:
                print(' ' + a)
        else:
            print('  /')

Output:

Fndc4 is in
  /
Itgb8 is in
  PI3K-Akt signaling pathway
  Focal adhesion
  ECM-receptor interaction
  Cell adhesion molecules (CAMs)
  Regulation of actin cytoskeleton
  Hypertrophic cardiomyopathy (HCM)
  Arrhythmogenic right ventricular cardiomyopathy (ARVC)
  Dilated cardiomyopathy
Cdc34 is in
  Ubiquitin mediated proteolysis
  Herpes simplex infection
Olfr1403 is in
  Olfactory transduction