Gene Set Enrichment Analysis (GSEA, gsea
)¶
Gene Set Enrichment Analysis (GSEA) [GSEA] aims to identify enriched gene sets given gene expression data for multiple samples with their phenotypes.
Examples: gene expression data¶
The following examples use a gene expression data set from the GEO database. We show the same analysis on two formats of data.
With samples as instances (in rows):
import Orange
from orangecontrib.bio import dicty, geneset, gsea, gene, geo
gds = geo.GDS("GDS10")
data = gds.getdata(transpose=True)
matcher = gene.matcher([gene.GMKEGG("Homo sapiens")])
genesets = geneset.collections((("KEGG",), "Homo sapiens"))
#the number of permutations (n) should be much higher
res = gsea.run(data, gene_sets=genesets, matcher=matcher,
min_part=0.05, permutation="phenotype", n=10,
phen_desc=data.domain["tissue"], gene_desc=True)
print
print "GSEA results (descriptor: tissue)"
print "%-40s %6s %6s %6s %7s" % ("LABEL", "NES", "FDR", "SIZE", "MATCHED")
for gs, resu in sorted(res.items(), key=lambda x: x[1]["fdr"])[:10]:
print "%-40s %6.3f %6.3f %6d %7d" % (gs.name[:30],
resu["nes"], resu["fdr"], resu["size"], resu["matched_size"])
With samples as features (in columns):
import Orange
from orangecontrib.bio import dicty, geneset, gsea, gene, geo
gds = geo.GDS("GDS10")
data = gds.getdata()
matcher = gene.matcher([gene.GMKEGG("Homo sapiens")])
genesets = geneset.collections((("KEGG",), "Homo sapiens"))
#the number of permutations (n) should be much higher
res = gsea.run(data, gene_sets=genesets, matcher=matcher,
min_part=0.05, permutation="phenotype", n=10,
phen_desc="tissue", gene_desc="gene")
print
print "GSEA results (descriptor: tissue)"
print "%-40s %6s %6s %6s %7s" % ("LABEL", "NES", "FDR", "SIZE", "MATCHED")
for gs, resu in sorted(res.items(), key=lambda x: x[1]["fdr"])[:10]:
print "%-40s %6.3f %6.3f %6d %7d" % (gs.name[:30],
resu["nes"], resu["fdr"], resu["size"], resu["matched_size"])
Both scripts output:
GSEA results (descriptor: tissue)
LABEL NES FDR SIZE MATCHED
Porphyrin and chlorophyll meta -1.817 0.000 43 23
Staphylococcus aureus infectio -1.998 0.000 59 28
Non-homologous end-joining 1.812 0.000 13 12
Fanconi anemia pathway 1.911 0.000 53 27
Cell cycle 1.777 0.000 124 106
Glycine, serine and threonine -2.068 0.000 39 29
HIF-1 signaling pathway -1.746 0.000 106 90
Ether lipid metabolism -1.788 0.000 42 27
Fc epsilon RI signaling pathwa -1.743 0.000 70 53
B cell receptor signaling path -1.782 0.000 72 62
Example: our own gene sets¶
We present a simple example on iris data set. Because data set is not a gene expression data set, we had to specify our own sets of features that belong together.
import Orange
import orangecontrib.bio.gsea
import orangecontrib.bio.gene
import orangecontrib.bio.geneset
data = Orange.data.Table("iris")
gen1 = orangecontrib.bio.geneset.GeneSets(dict([
("sepal", ["sepal length", "sepal width"]),
("petal", ["petal length", "petal width", "petal color"])
]))
res = orangecontrib.bio.gsea.run(data, gene_sets=gen1, matcher=orangecontrib.bio.gene.matcher([]), min_size=2)
print "%5s %6s %6s %s" % ("LABEL", "NES", "P-VAL", "GENES")
for gs,resu in res.items():
print "%5s %6.3f %6.3f %s" % (gs.id, resu["nes"], resu["p"], str(resu["genes"]))
The output:
LABEL NES P-VAL GENES
sepal 1.087 0.630 ['sepal width', 'sepal length']
petal -1.117 0.771 ['petal width', 'petal length']
Example: directly passing correlation data¶
GSEA can also directly use correlation data between individual genes
and a phenotype. If (1) input data with only one example (attribute names
are gene names) or (2) there is only one continuous feature in the given
data set (gene names are in the first Orange.feature.String
.
The following outputs ten pathways with smallest p-values.
import Orange
from orangecontrib.bio import dicty, geneset, gsea, gene
dbc = dicty.DictyExpress()
data = dbc.get_data(sample='pkaC-', time="8")
#select the first chip (the first attribute)
data = data.translate([data.domain.attributes[0]], True)
matcher = gene.matcher([[gene.GMKEGG("dicty"), gene.GMDicty()]])
genesets = geneset.collections((("KEGG",), "dicty"))
res = gsea.direct(data, matcher=matcher, min_part=0.05,
gene_sets=genesets)
print "%-40s %6s %6s %6s %7s" % ("LABEL", "NES", "P-VAL", "SIZE", "MATCHED")
for name,resu in sorted(res.items()[:10], key=lambda x: x[1]["p"]):
print "%-40s %6.3f %6.3f %6d %7d" % (name.name[:35], resu["nes"],
resu["p"], resu["size"], resu["matched_size"])
The output:
LABEL NES P-VAL SIZE MATCHED
Biosynthesis of amino acids 1.407 0.056 58 40
beta-Alanine metabolism 1.165 0.232 13 10
Taurine and hypotaurine metabolism 1.160 0.413 4 3
Porphyrin and chlorophyll metabolis -0.990 0.517 14 5
Valine, leucine and isoleucine degr 0.897 0.585 29 21
Ether lipid metabolism 0.713 0.857 10 6
Biosynthesis of unsaturated fatty a 0.659 0.922 10 6
Protein processing in endoplasmic r 0.647 0.941 71 40
RNA polymerase 0.550 0.943 24 7
Glycosylphosphatidylinositol(GPI)-a -0.540 0.946 19 4
[GSEA] | Subramanian, Aravind et al. Gene set enrichment analysis: A knowledge-based approach for interpreting genome-wide expression profiles. PNAS, 2005. |