Chromosome distribution ======================= In this example, we will analyze the genomic locations of transcription factor targets, to determine if transription factors favor specific chromosomes. Also, we will compare transcription factors on the chromosomes they target, to see if there are transcription factors that target similar sets of chromosomes. Likewise, we will compare chromosomes, to see if there are chromosomes that attract common transcription factors. Importing the data ~~~~~~~~~~~~~~~~~~ We will use the transcription factor data that has been imported in the previous example. The same data can be obtained directly using:: >>> yeastract = Get.yeast.yeastract() Next to this transcription factor data, we also need the location of all genes on the chromosomes. This information can be found in the ``SGD_features.tab``, which can be obtained from yeastgenome.org. Unfortunately, similar to yeastract, also this file comes without fieldnames, so we specify those through the type:: rtype = """[feats:*]<(sgdid=bytes, feat_type=bytes, feat_qual=bytes, feat_name=bytes, gene_name=bytes, gene_aliases=bytes, feat_parent_name=bytes, sgdid_alias=bytes, chromosome=bytes, start=bytes, stop=bytes, strand=bytes[1], genetic_pos=bytes, coordinate_version=bytes[10], sequence_version=bytes, description=bytes)""" res = Read(Fetch("http://downloads.yeastgenome.org/curation/chromosomal_feature/SGD_features.tab"),dtype=rtype) Note that, instead of specifying the type, we could also just have named the slicees that were needed, for example using:: res = res/{'f3': 'feat_name', 'f8':'chromosome', 'f9':'start'} This would rename field 3, 8 and 9 (starting from 0!). Type casting ^^^^^^^^^^^^ While not all fields will be used in this example, for the purpose of the tutorial we will attempt to prepare the whole dataset for easy use. First, when reading a file like this one, all input data is in string(bytes) format. For some slices this is not the ideal format. Therefore, we change the types of certain slices from ``bytes`` to ``int`` and ``real`` types. This is an operation that is known as casting:: res = res.To(_.start, _.stop, Do=_.Cast("int?")) res = res.To(_.genetic_pos, Do=_.Cast("real64?")) ``To`` is a utility function, which allow one to apply other operations to a subselection of the slices in a data set. In this case, we cast the ``start`` and ``stop`` slice to an integer type, and the ``genetic_pos`` slice to a double floating point type. Note that we do specify ``int?``, i.e. with a question mark sign. The indicates that missing values (empty fields) are allowed. .. note:: Maybe you ask yourself why we do not use the following approach:: >>> res.genetic_pos = res.genetic_pos.Cast("real64?") The reason is that res could have been used in another query before executing this command. Changing res by performing this operation would therefore lead to some problems because of the lazy nature of query execution in Ibidas. It might be possible to allow this in the future, however it would require some trickery/overhead. So, for now, we use the approach with the ``To`` operation. Applying a regular Python function and filtering ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ Next, we take a look at the ``gene_aliases`` slice, which contains multiple gene aliases separated by the '|' symbol. We would like to split these strings into individual names, and remove the empty names. For the split operation, we use the standard Python split function. The whole expression becomes:: >>> splitfunc = _.split('|') >>> res.gene_aliases.Each(splitfunc).Elems()[_ != ""] `splitfunc` is here a context operator based expression, which can be applied to a string in order to split it. ``Each`` applies a regular python function or a context object to each element in a slice. The slice returning from this has in this case as type lists of strings, as that is the output of the splitfunc operation. ``Elems`` `unpacks` this resulting list of names, such that subsequent operations will be performed on the list elements instead of the list itself. ``Filter``, denoted by the `[]`, only keeps elements (denoted by the context operator) that are unequal to the empty string. .. note:: Note that Ibidas cannot know what type will result from the function used in the ``Each`` operation. For that reason it will automatically perform type detection when necessary for subsequent operations. It is possible to prevent this by specifying the type at forehand. Also, instead of the context operation one can use regular python functions, which (at the moment) execute slightly faster:: >>> splitfunc = lambda x: x.split('|') >>> dtype = "[aliases:~]>> res.gene_aliases.Each(splitfunc, dtype=dtype).Elems()[_ != ""] (lambda allows one to define anonymous functions in Python) To make these modified gene_aliases slice part of the dataset, we apply them again using the ``To`` function, and store the results using ``Copy``:: splitfilter = _.Each(splitfunc, dtype=dtype).Elems()[_ != ""] yeast_feats = res.To(_.gene_aliases, Do=splitfilter).Copy() Short version ^^^^^^^^^^^^^ To obtain both datasets directly, use:: yeast_feats = Get.yeast.genomic_feats() yeastract = Get.yeast.yeastract() Linking the datasets ~~~~~~~~~~~~~~~~~~~~ Now, we have to link both the yeastract dataset and the genomic features dataset. This is done by matching the ``targets`` in the Yeastract dataset with the ``feat_name`` slice in the genomic features dataset. This can be accomplished using the ``Match`` operation, which links rows in two datasets based on equality of the entries in two slices. For example, we could use:: >>> tf_feat = yeastract |Match(_.target, _.feat_name)| yeast_feats to match both datasets on their target and feat_name slice. However, there is the small problem that both datasets have different upper/lowercase usage, due to which most target and feat_name names do not match with each other. So, instead, we convert each target and feat_name to upper case before matching:: >>> tf_feat = yeastract |Match(_.target.Each(str.upper), _.feat_name.Each(str.upper))| yeast_feats >>> tf_feat #only showing a few slices... Slices: | trans_factor | target | sgdid | feat_type | feat_qual ----------------------------------------------------------------------------------------------------------- Type: | bytes | bytes | bytes | bytes | bytes Dims: | yeastract_feats:* | yeastract_feats:* | yeastract_feats:* | yeastract_feats:* | yeastract_feats:* Data: | | | | | | Gcr2 | YAL008w | S000000006 | ORF | Verified | Met4 | YAL008w | S000000006 | ORF | Verified | Otu1 | YAL008w | S000000006 | ORF | Verified | ... | ... | ... | ... | ... When using a regular ``Match`` operation, any ``target`` row for which no entry can be found in ``feat_name`` will be left out, and vice versa (there are options to prevent this). Sidestep: Checking what is linked ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ The linking of both datasets is now complete. In this section, we will determine which elements could not be linked, and see if we can do better. These steps are performed just to introduce some commands and concepts, and are not necessary to complete the example. First, we do a quick check to determine how many rows in the yeastract dataset could not be matched. A naive approach to this would be:: >>> yeastract.target.Count() - tf_feat.target.Count() Slices: | target ---------------- Type: | int64 Dims: | Data: | | 72 On a total of 48010 pairs, it appears thus that we lost only a few transcription factor-target pairs. This assumes however that `yeast_feats` did not have any non-unique names in `feat_name`, as repeated names will match multiple times to the same entry in yeastract, and thus increases the number of entries. As an illustration, say we have:: >>> d1 = Rep([1,2,3,3]) >>> d2 = Rep([1,3,3]) >>> d1 |Match| d2 Slices: | data --------------- Type: | int64 Dims: | d1:* Data: | | 1 | 3 | 3 | 3 | 3 Thus, two rows with 3's match in ``d1`` match each to two rows of 3's in ``d2``, resulting in 2 * 2 rows of 3's in the output. It is easy to determine that `yeast_feats` does not have such non-unique names, using:: >>> yeast_feats.feat_name[_ != ""].Get(_.Count() == _.Unique().Count()) Slices: | feat_name ------------------- Type: | bool Dims: | Data: | | True This command removes the empty feat_names (which do not occur in `yeastract`), and then counts the remaining feat_names, and compares this to a count of the remaining unique feat_names. However, even a better approach is to circumvent this extra assumption, by checking if the rows in yeastract do actually occur in tf_feat:: >>> (yeastract |Except| tf_feat.Get(_.trans_factor, _.target)).Count() Slices: | trans_factor | target ------------------------------- Type: | int64 | int64 Dims: | | Data: | | | 72 | 72 This introduces the ``Except`` command. This command only keeps rows of yeastract that do not occur in tf_feat. These rows are subsequently counted. Note that this gives the same answer as we had before. A shorter version of this command, that also scales to cases in which `yeastract` has many slices, is the following:: >>> (yeastract |Except| tf_feat.Get(*yeastract.Names)).Count() Next, we determine which target names could not matched:: >>> nonmatched = yeastract.target |Except| tf_feat.target >>> nonmatched.Show() Slices: | target --------------------------------------- Type: | bytes Dims: | syeastract_syeastract_feats:* Data: | | YLR157w-c | A1 | YJL012c-a | MALT | MALS | snR20 | A2 | YAR044w | RDN5 | YJL017w | ALD1 | YGR272c | YBL101w-b | YBL101w-c | YDL038c | YBL101w-a | TER1 | SUC6 | YDR524w-a | YDR474c | YBR075w | DEX2 Using ``Except``, we keep only the targets in yeastract that do not occur in ``tf_feat.target``. Another lower level way to accomplish the same result would be:: >>> non_matched = (yeastract.target.Set() - tf_feat.target.Set()).Elem() ``Set`` is used to pack the elements of the (by default last) dimension into a set. A set is a collection of objects in which each element is unique. That is, adding the string "YLR157W-C" multiple times to a set will result in a set with just one occurence of "YLR157W-C". Sets have some special operations defined on them. One of them is set substraction, which was used here. It removes all elements in the set of the first operand that also occur in the set of the second operand, leaving only the elements that do not occur in the second operand. In this case thus the elements that were not matched by the Match operation. Next, we use the ``Elem`` operation to unpack the resulting set. The names in the list suggest that we might find matching rows by looking either at the ``gene_name`` or ``gene_aliases`` column of the `yeast_feats` dataset Before we do this, we first convert each name in nonmatched to uppercase:: >>> nonmatched = nonmatched.Each(str.upper) First, we check the ``gene_name`` column. This does not give any matches however:: >>> nonmatched |In| yeast_feats.gene_name.Each(str.upper) Slices: | result ----------------------------- Type: | bool Dims: | stftargets_sfeats:* Data: | | False | False | False | ... (Use Show() to see the whole result). This introduces the ``In`` operation, which determines if elements in the left operand occur in the (by default last) dimension of the right operand. Next we look at the gene_aliases column. As you might remember this slice does contain nested arrays of aliases. So what will ``|In|`` return here?:: >>> nonmatched.Each(str.upper) |In| yeast_feats.gene_aliases.Each(str.upper) Slices: | result ---------------------------------------------------- Type: | bool Dims: | stftargets_sfeats:*>> Any(nonmatched |In| yeast_feats.gene_aliases.Each(str.upper)) Slices: | result ----------------------------- Type: | bool Dims: | stftargets_sfeats:* Data: | | True | True | True | ... This aggregates across the ``feats`` dimension, to determine if any of the features had any alias that matched something in our list. Indeed, we found matches for the targets. We will use the Match function to find which genes match to these non-matched targets (we could also have done this directly of course, but that would have prevented us from introducing some operations). Using Flat, we flatten the gene alias list, and then apply Match as we did before:: >>> nonmatched_feats = nonmatched |Match(_.target, _.gene_aliases.Each(str.upper))| yeast_feats.Flat() >>> nonmatched_feats Slices: | target | sgdid | feat_type | feat_qual | feat_name --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- Type: | bytes[11] | bytes | bytes | bytes | bytes[11] Dims: | stftargets_sfeats_feats_falias~ | stftargets_sfeats_feats_falias~ | stftargets_sfeats_feats_falias~ | stftargets_sfeats_feats_falias~ | stftargets_sfeats_feats_falias~ Data: | | | | | | YLR157W-C | S000028678 | ORF | Uncharacterized | YLR157W-E | YAR044W | S000000081 | ORF | Verified | YAR042W | YBL101W-C | S000028598 | ORF | Uncharacterized | YBL100W-C | YBL101W-A | S000002148 | transposable_element_gene | | YBL100W-A | YJL017W | S000003553 | ORF | Uncharacterized | YJL016W | A1 | S000029660 | not in systematic sequence of ~ | | MATA1 | YJL012C-A | S000003549 | ORF | Verified | YJL012C | MALT | S000000502 | ORF | Verified | YBR298C | MALT | S000003521 | ORF | Verified | YGR289C | MALT | S000029681 | not in systematic sequence of ~ | | MAL21 | MALT | S000029686 | not in systematic sequence of ~ | | MAL41 | MALT | S000029658 | not in systematic sequence of ~ | | MAL61 | MALS | S000000503 | ORF | Verified | YBR299W | MALS | S000003524 | ORF | Verified | YGR292W | ... | ... | ... | ... | ... This shows a possible reason why some of these targets do not have an offical name, as they match to multiple genomic features. However, other targets only have a single corresonding genomic feature, and could have been linked. To improve our mapping, we decide to redo our match, and include rows that have a unique ``gene_alias`` match. Our strategy is as follows: 1. Filter out gene_aliases that occur multiple times, as we only want unique matches 2. Convert yeastract targets names that match to gene_aliases to the corresponding feat_names 3. Rematch the data. First, we determine what names need to be filtered, and filter these from the gene_aliases:: >>> unique_gene_aliases = yeast_feats.Flat().GroupBy(_.gene_aliases)[Count(_.feat_name) == 1].gene_aliases >>> name_alias_list = yeast_feats[_.gene_aliases |In| unique_gene_aliases] The first command flattens the nested gene_alias lists, to get a flat table (If there were would have been more than one nested list dimension, we would have had to specify `yeast_feats.Flat(_.gene_aliases)`). Next, we group the data on common gene_aliases, and then remove those gene_aliases that have more than more than one associated feat_name. Subsequently, we filter the yeast_feats table, such that we only keep the gene_aliases that are in the list of unique gene aliases. In the second step, we convert the yeastract names that occur in the gene_aliases. This can be done using the ``TakeFrom`` command:: >>> convert_table = name_alias_list.Get(_.gene_aliases.Each(str.upper), _.feat_name).Flat() >>> yeastract = yeastract.To(_.target, Do=_.Each(str.upper).TakeFrom(convert_table, keep_missing=True)) The TakeFrom command takes a two-slice table (convert_table), and converts the target names that occur in the first slice of the table to the names of the second slice of the table. We set keep_missing to true, to also keep the names that do not occur in the gene_aliases. Now we can redo our match, as we did before:: >>> tf_feat = yeastract |Match(_.target.Each(str.upper), _.feat_name.Each(str.upper))| yeast_feats Counting again the number of yeastract rows that could be matched, we find:: >>> (yeastract |Except| tf_feat.Get(*yeastract.Names)).Count() Slices: | trans_factor | target ------------------------------- Type: | int64 | int64 Dims: | | Data: | | | 6 | 6 Thus, 72 - 6 = 66 additional rows in yeastract have been matched. Short version ^^^^^^^^^^^^^ To obtain directly the results of the last section, do:: #remove non-unique gene_aliases >>> name_alias_list = yeast_feats[_.gene_aliases |In| _.Flat().GroupBy(_.gene_aliases)[Count(_.feat_name) == 1].gene_aliases] #convert yeastract target names that match to gene_aliases, to the corresponding feat_names >>> convert_table = name_alias_list.Get(_.gene_aliases.Each(str.upper), _.feat_name).Flat() >>> yeastract = yeastract.To(_.target, Do=_.Each(str.upper).TakeFrom(convert_table, keep_missing=True)) >>> tf_feat = yeastract |Match(_.target.Each(str.upper), _.feat_name.Each(str.upper))| yeast_feats Save dataset ~~~~~~~~~~~~ First, we save the current dataset. This can be done using:: >>> Save(tf_feat, 'tf_feat.dat') The data can be loaded again using:: >>> tf_feat = Load('tf_feat.dat') Chromosome distribution ~~~~~~~~~~~~~~~~~~~~~~~ We start with determining for each transcription factor the number of targets per chromosome. To do this, we use a two-dimensional group, grouping both on transcription factor and chromosome, and counting the number of targets per transcription_factor / chromosome pair:: >>> tf_feat = tf_feat.GroupBy(_.trans_factor, _.chromosome) >>> res = tf_feat.Get(_.trans_factor, _.chromosome, _.target.Count()/"count", _.start).Copy() >>> res Slices: | trans_factor | chromosome | count | start ----------------------------------------------------------------------------------------------------------------------------------------------------------------- Type: | bytes | bytes | int64 | int64? Dims: | gtrans_factor:* | gchromosome:* | gtrans_factor:*>> Corr(res.count) However, the resulting correlations are positively biased as we did not control for the different numbers of genes on each chromosome. To normalize the count data, we divide by the total number of targets per chromosome:: >>> Corr(res.count.Cast("real64") / res.count.Sum("gtrans_factor").count) Slices: | count ----------------------------------------------------------------------------------------------------------------------------------------------------------------- Type: | real64 Dims: | gtrans_factor:*>> chr_normtf = res.To(_.count, Do=_.Cast("real64") / _.count.Sum("gchromosome")) >>> Corr(chr_normtf.count.Transpose()) Slices: | count ------------------------------------------------------------------------------------------------------------------------------------ Type: | real64 Dims: | gchromosome:*>> chr_normtf.Sort(_.chromosome.Cast("int?")).Get(_.chromosome, Corr(_.count.Transpose()/"chromo_corr")).Show() Slices: | chromosome | chromo_corr -------------------------------------------------------------------------------------------------------------------------------------------------------------------------- Type: | int64? | real64? Dims: | gchromosome:* | gchromosome:*>> from matplotlib.pylab import * res = chr_normtf.Sort(_.chromosome.Cast("int?")).Get(_.chromosome, Corr(_.count.Transpose()/"chromo_corr")) imshow(res.chromo_corr(), interpolation='nearest') xticks(Pos(res.chromosome)(), res.chromosome()) yticks(Pos(res.chromosome)(), res.chromosome()) colorbar() show() .. image:: chromo_corr.png Transcription factor specificity ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ As last step, we like to calculate to what extent transcription factors target specific chromosomes. First, we obtain a dataset that is normalized for counts per chromosome:: >>> chr_normchr = res.To(_.count, Do=_.Cast("real64") / _.count.Sum("gtrans_factor")) Next, we group for each TF the chromosome counts from low to high. Subsequently, we sum across the rows, for all transcription factors, to get the following result:: >>> chr_normtf.count.Sort().Sum("gtrans_factor") Slices: | count ------------------------- Type: | real64 Dims: | gchromosome:* Data: | | 0.0300120048019 | 0.19843303089 | 0.55413076386 | 0.653791379541 | 0.718104362671 | 0.776878372718 | 0.829423749173 | 0.885342987674 | 0.927864695259 | 0.962609282328 | 1.00208988772 | 1.05152541613 | 1.1006070164 | 1.15928017163 | 1.22155081093 | 1.30419432548 | 1.50241732864 | 3.12174441416 It seems that indeed there is some chromosome specificness for transcription factors (although making this a hard conclusion would probably require a permutation analysis). Try for yourself to see if the effect persists if you remove all transcription factors with less than 20 targets from the data. We plot the results using matplotlib:: >>> from matplotlib.pylab import * >>> plot(normalized_counts.Sort().Sum("gtrans_factor")()) >>> title("Chromosome specificness of transcription factors") >>> ylabel("Normalized target counts") >>> xlabel("Less visited --> Most visited chromosome") >>> show() .. image:: chromo_spec.png Summary ~~~~~~~ To directly get the results, do:: #data import >>> yeast_feats = Get.yeast.genomic_feats() >>> yeastract = Get.yeast.yeastract() #structurize data >>> res = yeastract |Match(_.target.Each(str.upper), _.feat_name.Each(str.upper))| yeast_feats >>> res = res.GroupBy(_.trans_factor, _.chromosome) >>> res = res.Get(_.trans_factor, _.chromosome, _.target.Count()/"count", _.start).Copy() #tf similarity >>> chr_normchr = res.To(_.count, Do=_.Cast("real64") / _.count.Sum("gtrans_factor")) >>> chr_normchr.Get(_.trans_factor, Corr(_.count)) #chromosome similarity, sorted on chromosome >>> chr_normtf = res.To(_.count, Do=_.Cast("real64") / _.count.Sum("gchromosome")) >>> chr_normtf.Sort(_.chromosome.Cast("int?")).Get(_.chromosome, Corr(_.count.Transpose()/"chromo_corr")).Show() #tf specificity >>> chr_normtf.count.Sort().Sum("gtrans_factor")