Gene Set Enrichment Analysis (GSEA, gsea)

Gene Set Enrichment Analysis (GSEA) [GSEA] aims to identify enriched gene sets given gene expression data for multiple samples with their phenotypes.

Examples: gene expression data

The following examples use a gene expression data set from the GEO database. We show the same analysis on two formats of data.

With samples as instances (in rows):

import Orange
from orangecontrib.bio import dicty, geneset, gsea, gene, geo

gds = geo.GDS("GDS10")
data = gds.getdata(transpose=True) 

matcher = gene.matcher([gene.GMKEGG("Homo sapiens")])
genesets = geneset.collections((("KEGG",), "Homo sapiens"))

#the number of permutations (n) should be much higher
res = gsea.run(data, gene_sets=genesets, matcher=matcher, 
    min_part=0.05, permutation="phenotype", n=10, 
    phen_desc=data.domain["tissue"], gene_desc=True) 

print
print "GSEA results (descriptor: tissue)"
print "%-40s %6s %6s %6s %7s" % ("LABEL", "NES", "FDR", "SIZE", "MATCHED") 
for gs, resu in sorted(res.items(), key=lambda x: x[1]["fdr"])[:10]: 
    print "%-40s %6.3f %6.3f %6d %7d" % (gs.name[:30],
        resu["nes"], resu["fdr"], resu["size"], resu["matched_size"]) 

With samples as features (in columns):

import Orange
from orangecontrib.bio import dicty, geneset, gsea, gene, geo

gds = geo.GDS("GDS10")
data = gds.getdata() 

matcher = gene.matcher([gene.GMKEGG("Homo sapiens")])
genesets = geneset.collections((("KEGG",), "Homo sapiens"))

#the number of permutations (n) should be much higher
res = gsea.run(data, gene_sets=genesets, matcher=matcher, 
    min_part=0.05, permutation="phenotype", n=10, 
    phen_desc="tissue", gene_desc="gene") 

print
print "GSEA results (descriptor: tissue)"
print "%-40s %6s %6s %6s %7s" % ("LABEL", "NES", "FDR", "SIZE", "MATCHED") 
for gs, resu in sorted(res.items(), key=lambda x: x[1]["fdr"])[:10]: 
    print "%-40s %6.3f %6.3f %6d %7d" % (gs.name[:30],
        resu["nes"], resu["fdr"], resu["size"], resu["matched_size"]) 

Both scripts output:

GSEA results (descriptor: tissue)
LABEL                                       NES    FDR   SIZE MATCHED
Porphyrin and chlorophyll meta           -1.817  0.000     43      23
Staphylococcus aureus infectio           -1.998  0.000     59      28
Non-homologous end-joining                1.812  0.000     13      12
Fanconi anemia pathway                    1.911  0.000     53      27
Cell cycle                                1.777  0.000    124     106
Glycine, serine and threonine            -2.068  0.000     39      29
HIF-1 signaling pathway                  -1.746  0.000    106      90
Ether lipid metabolism                   -1.788  0.000     42      27
Fc epsilon RI signaling pathwa           -1.743  0.000     70      53
B cell receptor signaling path           -1.782  0.000     72      62

Example: our own gene sets

We present a simple example on iris data set. Because data set is not a gene expression data set, we had to specify our own sets of features that belong together.

import Orange
import orangecontrib.bio.gsea
import orangecontrib.bio.gene
import orangecontrib.bio.geneset

data = Orange.data.Table("iris")

gen1 = orangecontrib.bio.geneset.GeneSets(dict([
    ("sepal", ["sepal length", "sepal width"]), 
    ("petal", ["petal length", "petal width", "petal color"])
    ]))

res = orangecontrib.bio.gsea.run(data, gene_sets=gen1, matcher=orangecontrib.bio.gene.matcher([]), min_size=2)
print "%5s  %6s %6s %s" % ("LABEL", "NES", "P-VAL", "GENES")
for gs,resu in res.items():
    print "%5s  %6.3f %6.3f %s" % (gs.id, resu["nes"], resu["p"], str(resu["genes"]))

The output:

LABEL     NES  P-VAL GENES
sepal   1.087  0.630 ['sepal width', 'sepal length']
petal  -1.117  0.771 ['petal width', 'petal length']

Example: directly passing correlation data

GSEA can also directly use correlation data between individual genes and a phenotype. If (1) input data with only one example (attribute names are gene names) or (2) there is only one continuous feature in the given data set (gene names are in the first Orange.feature.String. The following outputs ten pathways with smallest p-values.

import Orange
from orangecontrib.bio import dicty, geneset, gsea, gene

dbc = dicty.DictyExpress()
data = dbc.get_data(sample='pkaC-', time="8")

#select the first chip (the first attribute)
data = data.translate([data.domain.attributes[0]], True)

matcher = gene.matcher([[gene.GMKEGG("dicty"), gene.GMDicty()]])
genesets =  geneset.collections((("KEGG",), "dicty"))

res = gsea.direct(data, matcher=matcher, min_part=0.05, 
    gene_sets=genesets)

print "%-40s %6s %6s %6s %7s" % ("LABEL", "NES", "P-VAL", "SIZE", "MATCHED")
for name,resu in sorted(res.items()[:10], key=lambda x: x[1]["p"]): 
    print "%-40s %6.3f %6.3f %6d %7d" % (name.name[:35], resu["nes"],
        resu["p"], resu["size"], resu["matched_size"]) 

The output:

LABEL                                       NES  P-VAL   SIZE MATCHED
Biosynthesis of amino acids               1.407  0.056     58      40
beta-Alanine metabolism                   1.165  0.232     13      10
Taurine and hypotaurine metabolism        1.160  0.413      4       3
Porphyrin and chlorophyll metabolis      -0.990  0.517     14       5
Valine, leucine and isoleucine degr       0.897  0.585     29      21
Ether lipid metabolism                    0.713  0.857     10       6
Biosynthesis of unsaturated fatty a       0.659  0.922     10       6
Protein processing in endoplasmic r       0.647  0.941     71      40
RNA polymerase                            0.550  0.943     24       7
Glycosylphosphatidylinositol(GPI)-a      -0.540  0.946     19       4
[GSEA]Subramanian, Aravind et al. Gene set enrichment analysis: A knowledge-based approach for interpreting genome-wide expression profiles. PNAS, 2005.