NCBI’s Gene Expression Omnibus interface (geo
)¶
This module provides an interface to NCBI‘s Gene Expression Omnibus repository. It supports GEO DataSets query and retrieval.
In the following example GDS.getdata
construct a data set with genes in rows and samples in
columns. Notice that the annotation about each sample is retained
in .attributes
.
>>> import orangecontrib.bio.geo
>>> gds = orangecontrib.bio.geo.GDS("GDS1676")
>>> data = gds.getdata()
>>> len(data)
717
>>> data[200]
[1.407, 1.268, 1.969, 1.516, 1.697, 1.609, 0.760, 0.073], {"gene":'CD70'}
>>> data.domain.attributes[0]
Orange.feature.Continuous 'GSM63816'
>>> data.domain.attributes[0].attributes
{'dose': '20 U/ml IL-2', 'infection': 'acute ', 'time': '1 d'}
GDS classes¶
-
class
orangecontrib.bio.geo.
GDSInfo
(force_update=False)¶ Retrieve infomation about GEO DataSets. The class accesses the Orange server file that either resides on the local computer or is automatically retrieved from Orange server. Calls to this class do not access any NCBI’s servers.
Constructor returning the object with GEO DataSets information. If force_update is True, the constructor will download GEO DataSets information file (gds_info.pickled) from Orange server, otherwise it will first check the local copy.
An instance behaves like a dictionary: the keys are GEO DataSets IDs, and the dictionary values for is a dictionary providing various information about the particular data set.
An example with GDSInfo
:
>>> import orangecontrib.bio.geo
>>> info = orangecontrib.bio.geo.GDSInfo()
>>> list(info.keys())[:5]
>>> ['GDS2526', 'GDS2524', 'GDS2525', 'GDS2522', 'GDS1618']
>>> info['GDS2526']['title']
'c-MYC depletion effect on carcinoma cell lines'
>>> info['GDS2526']['platform_organism']
'Homo sapiens'
-
class
orangecontrib.bio.geo.
GDS
(gdsname, verbose=False, force_download=False)¶ Retrieval of a specific GEO DataSet as a
Orange.data.Table
.Constructor returns the object that can retrieve GEO DataSet (samples and gene expressions). It first checks a local cache directory if the particular data file is loaded locally, else it downloads it from NCBI’s GEO FTP site.
Parameters: - gdsname – An NCBI’s ID for the data set in the form “GDSn” where “n” is a GDS ID number.
- force_download – Force the download.
-
getdata
(report_genes=True, merge_function=<function spots_mean>, sample_type=None, transpose=False, remove_unknown=None)¶ Returns the GEO DataSet as an
Orange.data.Table
.Parameters: - report_genes – Microarray spots reported in the GEO data set can either be merged according to their gene ids (if True) or can be left as spots.
- transpose – The output table can have spots/genes in rows and samples in columns (False, default) or samples in rows and spots/genes in columns (True).
- sample_type – the type of annotation, or (if
transpose
is True) the type of class labels to be included in the data set. The entire annotation of samples will be included either in the class value or in the.attributes
field of each data set attribute. - remove_unknown – Remove spots with sample profiles that
include unknown values. They are removed if the proportion
of samples with unknown values is above the threshold set by
remove_unknown
. If None, nothing is removed.
-
sample_annotations
(sample_type=None)¶ Return a dictionary with sample annotation.
-
sample_to_class
(sample_type=None)¶ Return class values for GDS samples.
-
sample_types
()¶ Return a set of sample types.
Examples¶
The following script prints out information about a specific data
set. It does not download the data set, just uses the (local) GEO data
sets information file (geo_gds1.py
).
import orangecontrib.bio.geo
import textwrap
gdsinfo = orangecontrib.bio.geo.GDSInfo()
gds = gdsinfo["GDS10"]
print("ID:")
print(gds["dataset_id"])
print("Features: ")
print(gds["feature_count"])
print("Genes:")
print(gds["gene_count"])
print("Organism:")
print(gds["platform_organism"])
print("PubMed ID:")
print(gds["pubmed_id"])
print("Sample types:")
for sampletype in set([sinfo["type"] for sinfo in gds["subsets"]]):
ss = [sinfo["description"] for sinfo in gds["subsets"] if sinfo["type"]==sampletype]
print(" %s (%s)" % (sampletype, ", ".join(ss)))
print("")
print("Description:")
print("\n".join(textwrap.wrap(gds["description"], 70)))
The output of this script is:
ID:
GDS10
Features:
39114
Genes:
29822
Organism:
Mus musculus
PubMed ID:
11827943
Sample types:
tissue (spleen, thymus)
disease state (diabetic, diabetic-resistant, nondiabetic)
strain (NOD, Idd3, Idd5, Idd3+Idd5, Idd9, B10.H2g7, B10.H2g7 Idd3)
Description:
Examination of spleen and thymus of type 1 diabetes nonobese diabetic
(NOD) mouse, four NOD-derived diabetes-resistant congenic strains and
two nondiabetic control strains.
Samples in GEO data sets belong to sample subsets, which in turn belong
to specific types. The above GDS10 has three sample types, of which the
subsets for the tissue type are spleen and thymus. For supervised data
mining it would be useful to find out which data sets provide enough
samples for each label. It is (semantically) convenient to perform
classification within sample subsets of the same type. The following
script goes through all data sets and finds those with enough
samples within each of the subsets for a specific type. The function
valid
determines which subset types (if any) satisfy our criteria
(geo_gds5.py
).
import orangecontrib.bio.geo
def valid(info, n=40):
"""Return a set of subset types containing more than n samples in every subset"""
invalid = set()
subsets = set([sinfo["type"] for sinfo in info["subsets"]])
for sampleinfo in info["subsets"]:
if len(sampleinfo["sample_id"]) < n:
invalid.add(sampleinfo["type"])
return subsets.difference(invalid)
def report(stypes, info):
"""Pretty-print GDS and valid susbset types"""
for id, sts in stypes:
print(id)
for st in sts:
gds = info[id]
print(" %s:" % st + \
", ".join(["%s/%d" % (sinfo["description"], len(sinfo["sample_id"])) \
for sinfo in gds["subsets"] if sinfo["type"]==st]))
gdsinfo = orangecontrib.bio.geo.GDSInfo()
valid_subset_types = [(id, valid(info)) for id, info in sorted(gdsinfo.items()) if valid(info)]
report(valid_subset_types, gdsinfo)
print('datasets = ' + str(len(valid_subset_types)))
print('type subsets = ' + str(sum(len(b) for _,b in valid_subset_types)))
The requested number of samples, n=40
, seems to be a quite
a stringent criteria met - at the time of writing this -
by 40 data sets with 48 sample subsets. The output starts with:
GDS1292
tissue:raphe magnus/40, somatomotor cortex/43
GDS1293
tissue:raphe magnus/40, somatomotor cortex/41
GDS1412
protocol:no treatment/47, hormone replacement therapy/42
GDS1490
other:non-neural/50, neural/100
GDS1611
genotype/variation:wild type/48, upf1 null mutant/48
GDS2373
gender:male/82, female/48
GDS2808
protocol:training set/44, validation set/50
Let us now pick data set GDS2960 and see if we can predict the disease
state. We will use logistic regression, and within 10-fold cross
validation measure AUC, the area under ROC. AUC is the probability of
correctly distinguishing the two classes, (e.g., the disease and control).
From (geo_gds6.py
):
import Orange
import orangecontrib.bio.geo
gds = orangecontrib.bio.geo.GDS("GDS2960")
data = gds.getdata(sample_type="disease state", transpose=True, report_genes=True)
print("Samples: %d, Genes: %d" % (len(data), len(data.domain.attributes)))
if Orange.__version__ > "3":
learners = [ Orange.classification.LogisticRegressionLearner() ]
results = Orange.evaluation.testing.CrossValidation(data, learners, k=10)
else:
learners = [ Orange.classification.logreg.LibLinearLogRegLearner() ]
results = Orange.evaluation.testing.cross_validation(learners, data, folds=10)
print("AUC = %.3f" % Orange.evaluation.scoring.AUC(results)[0])
The output of this script is:
Samples: 101, Genes: 4068
AUC = 0.983
The AUC for this data set is very high, indicating that using these gene expression data it is almost trivial to separate the two classes.