NCBI’s Gene Expression Omnibus interface (geo)

This module provides an interface to NCBI‘s Gene Expression Omnibus repository. It supports GEO DataSets query and retrieval.

In the following example GDS.getdata construct a data set with genes in rows and samples in columns. Notice that the annotation about each sample is retained in .attributes.

>>> import orangecontrib.bio.geo
>>> gds = orangecontrib.bio.geo.GDS("GDS1676")
>>> data = gds.getdata()
>>> len(data)
717
>>> data[200]
[1.407, 1.268, 1.969, 1.516, 1.697, 1.609, 0.760, 0.073], {"gene":'CD70'}
>>> data.domain.attributes[0]
Orange.feature.Continuous 'GSM63816'
>>> data.domain.attributes[0].attributes
{'dose': '20 U/ml IL-2', 'infection': 'acute ', 'time': '1 d'}

GDS classes

class orangecontrib.bio.geo.GDSInfo(force_update=False)

Retrieve infomation about GEO DataSets. The class accesses the Orange server file that either resides on the local computer or is automatically retrieved from Orange server. Calls to this class do not access any NCBI’s servers.

Constructor returning the object with GEO DataSets information. If force_update is True, the constructor will download GEO DataSets information file (gds_info.pickled) from Orange server, otherwise it will first check the local copy.

An instance behaves like a dictionary: the keys are GEO DataSets IDs, and the dictionary values for is a dictionary providing various information about the particular data set.

An example with GDSInfo:

>>> import orangecontrib.bio.geo
>>> info = orangecontrib.bio.geo.GDSInfo()
>>> list(info.keys())[:5]
>>> ['GDS2526', 'GDS2524', 'GDS2525', 'GDS2522', 'GDS1618']
>>> info['GDS2526']['title']
'c-MYC depletion effect on carcinoma cell lines'
>>> info['GDS2526']['platform_organism']
'Homo sapiens'
class orangecontrib.bio.geo.GDS(gdsname, verbose=False, force_download=False)

Retrieval of a specific GEO DataSet as a Orange.data.Table.

Constructor returns the object that can retrieve GEO DataSet (samples and gene expressions). It first checks a local cache directory if the particular data file is loaded locally, else it downloads it from NCBI’s GEO FTP site.

Parameters:
  • gdsname – An NCBI’s ID for the data set in the form “GDSn” where “n” is a GDS ID number.
  • force_download – Force the download.
getdata(report_genes=True, merge_function=<function spots_mean>, sample_type=None, transpose=False, remove_unknown=None)

Returns the GEO DataSet as an Orange.data.Table.

Parameters:
  • report_genes – Microarray spots reported in the GEO data set can either be merged according to their gene ids (if True) or can be left as spots.
  • transpose – The output table can have spots/genes in rows and samples in columns (False, default) or samples in rows and spots/genes in columns (True).
  • sample_type – the type of annotation, or (if transpose is True) the type of class labels to be included in the data set. The entire annotation of samples will be included either in the class value or in the .attributes field of each data set attribute.
  • remove_unknown – Remove spots with sample profiles that include unknown values. They are removed if the proportion of samples with unknown values is above the threshold set by remove_unknown. If None, nothing is removed.
sample_annotations(sample_type=None)

Return a dictionary with sample annotation.

sample_to_class(sample_type=None)

Return class values for GDS samples.

sample_types()

Return a set of sample types.

Examples

The following script prints out information about a specific data set. It does not download the data set, just uses the (local) GEO data sets information file (geo_gds1.py).

import orangecontrib.bio.geo
import textwrap

gdsinfo = orangecontrib.bio.geo.GDSInfo()
gds = gdsinfo["GDS10"]

print("ID:")
print(gds["dataset_id"])
print("Features: ")
print(gds["feature_count"])
print("Genes:")
print(gds["gene_count"])
print("Organism:")
print(gds["platform_organism"])
print("PubMed ID:")
print(gds["pubmed_id"])
print("Sample types:")
for sampletype in set([sinfo["type"] for sinfo in gds["subsets"]]):
    ss = [sinfo["description"] for sinfo in gds["subsets"] if sinfo["type"]==sampletype]
    print("  %s (%s)" % (sampletype, ", ".join(ss)))
print("")
print("Description:")
print("\n".join(textwrap.wrap(gds["description"], 70)))

The output of this script is:

ID:
GDS10
Features:
39114
Genes:
29822
Organism:
Mus musculus
PubMed ID:
11827943
Sample types:
  tissue (spleen, thymus)
  disease state (diabetic, diabetic-resistant, nondiabetic)
  strain (NOD, Idd3, Idd5, Idd3+Idd5, Idd9, B10.H2g7, B10.H2g7 Idd3)

Description:
Examination of spleen and thymus of type 1 diabetes nonobese diabetic
(NOD) mouse, four NOD-derived diabetes-resistant congenic strains and
two nondiabetic control strains.

Samples in GEO data sets belong to sample subsets, which in turn belong to specific types. The above GDS10 has three sample types, of which the subsets for the tissue type are spleen and thymus. For supervised data mining it would be useful to find out which data sets provide enough samples for each label. It is (semantically) convenient to perform classification within sample subsets of the same type. The following script goes through all data sets and finds those with enough samples within each of the subsets for a specific type. The function valid determines which subset types (if any) satisfy our criteria (geo_gds5.py).

import orangecontrib.bio.geo

def valid(info, n=40):
    """Return a set of subset types containing more than n samples in every subset"""
    invalid = set()
    subsets = set([sinfo["type"] for sinfo in info["subsets"]])
    for sampleinfo in info["subsets"]:
        if len(sampleinfo["sample_id"]) < n:
            invalid.add(sampleinfo["type"])
    return subsets.difference(invalid)

def report(stypes, info):
    """Pretty-print GDS and valid susbset types"""
    for id, sts in stypes:
        print(id)
        for st in sts:
            gds = info[id]
            print("  %s:" % st + \
                    ", ".join(["%s/%d" % (sinfo["description"], len(sinfo["sample_id"])) \
                    for sinfo in gds["subsets"] if sinfo["type"]==st]))

gdsinfo = orangecontrib.bio.geo.GDSInfo()
valid_subset_types = [(id, valid(info)) for id, info in sorted(gdsinfo.items()) if valid(info)]
report(valid_subset_types, gdsinfo)

print('datasets = ' + str(len(valid_subset_types)))
print('type subsets = ' + str(sum(len(b) for _,b in valid_subset_types)))

The requested number of samples, n=40, seems to be a quite a stringent criteria met - at the time of writing this - by 40 data sets with 48 sample subsets. The output starts with:

GDS1292
  tissue:raphe magnus/40, somatomotor cortex/43
GDS1293
  tissue:raphe magnus/40, somatomotor cortex/41
GDS1412
  protocol:no treatment/47, hormone replacement therapy/42
GDS1490
  other:non-neural/50, neural/100
GDS1611
  genotype/variation:wild type/48, upf1 null mutant/48
GDS2373
  gender:male/82, female/48
GDS2808
  protocol:training set/44, validation set/50

Let us now pick data set GDS2960 and see if we can predict the disease state. We will use logistic regression, and within 10-fold cross validation measure AUC, the area under ROC. AUC is the probability of correctly distinguishing the two classes, (e.g., the disease and control). From (geo_gds6.py):

import Orange
import orangecontrib.bio.geo

gds = orangecontrib.bio.geo.GDS("GDS2960")
data = gds.getdata(sample_type="disease state", transpose=True, report_genes=True)
print("Samples: %d, Genes: %d" % (len(data), len(data.domain.attributes)))

if Orange.__version__ > "3":
    learners = [ Orange.classification.LogisticRegressionLearner() ]
    results = Orange.evaluation.testing.CrossValidation(data, learners, k=10)
else:
    learners = [ Orange.classification.logreg.LibLinearLogRegLearner() ]
    results = Orange.evaluation.testing.cross_validation(learners, data, folds=10)


print("AUC = %.3f" % Orange.evaluation.scoring.AUC(results)[0])

The output of this script is:

Samples: 101, Genes: 4068
AUC = 0.983

The AUC for this data set is very high, indicating that using these gene expression data it is almost trivial to separate the two classes.