Experiment class¶

class pytadbit.experiment.Experiment(name, resolution, hic_data=None, tad_def=None, parser=None, no_warn=False, weights=None, conditions=None, filter_columns=True)[source]¶

Hi-C experiment.

Parameters:

name – name of the experiment
resolution – the resolution of the experiment (size of a bin in bases)
hic_data (None) – whether a file or a list of lists corresponding to the Hi-C data
tad_def (None) – a file or a dict with precomputed TADs for this experiment
parser (None) –
a parser function that returns a tuple of lists representing the data matrix and the length of a row/column. With the file example.tsv:
```
chrT_001       chrT_002        chrT_003        chrT_004
chrT_001       629     164     88      105
chrT_002       164     612     175     110
chrT_003       88      175     437     100
chrT_004       105     110     100     278
```
the output of parser(‘example.tsv’) would be be: [([629, 164, 88, 105, 164, 612, 175, 110, 88, 175, 437, 100, 105, 110, 100, 278]), 4]
conditions (None) – list() of experimental conditions, e.g. the cell type, the enzyme... (i.e.: [‘HindIII’, ‘cortex’, ‘treatment’]). This parameter may be used to compare the effect of this conditions on the TADs
filter_columns (True) – filter the columns with unexpectedly high content of low values

TODO: doc conditions TODO: normalization

get_hic_matrix()[source]¶

Return the Hi-C matrix.

Returns:	list of lists representing the Hi-C data matrix of the current experiment

get_hic_zscores(normalized=True, zscored=True, remove_zeros=True)[source]¶

Normalize the Hi-C raw data. The result will be stored into the private Experiment._zscore list.

Parameters:	normalized (True) – whether to normalize the result using the weights (see `normalize_hic()`) zscored (True) – calculate the z-score of the data remove_zeros (True) – remove null interactions

load_hic_data(hic_data, parser=None, wanted_resolution=None, data_resolution=None, filter_columns=True)[source]¶

Add a Hi-C experiment to the Chromosome object.

Parameters:

hic_data (None) – whether a file or a list of lists corresponding to the Hi-C data
name – name of the experiment
force (False) – overwrite the experiments loaded under the same name
parser (None) –
a parser function that returns a tuple of lists representing the data matrix and the length of a row/column. With the file example.tsv:
```
chrT_001   chrT_002        chrT_003        chrT_004
chrT_001   629     164     88      105
chrT_002   86      612     175     110
chrT_003   159     216     437     105
chrT_004   100     111     146     278
```
the output of parser(‘example.tsv’) would be: [([629, 86, 159, 100, 164, 612, 216, 111, 88, 175, 437, 146, 105, 110, 105, 278]), 4]
resolution (None) – resolution of the experiment in the file; it will be adjusted to the resolution of the experiment. By default the file is expected to contain a Hi-C experiment with the same resolution as the pytadbit.Experiment created, and no change is made
filter_columns (True) – filter the columns with unexpectedly high content of low values

load_tad_def(tad_def, weights=None)[source]¶

Add the Topologically Associated Domains definition detection to Slice

Parameters:	tad_def (None) – a file or a dict with precomputed TADs for this experiment name (None) – name of the experiment, if None f_name will be used weights (None) – Store information about the weights, corresponding to the normalization of the Hi-C data (see tadbit function documentation)

model_region(start, end, n_models=5000, n_keep=1000, n_cpus=1, verbose=0, keep_all=False, close_bins=1, outfile=None, config={'maxdist': 600, 'upfreq': 0.3, 'reference': 'victor corces dataset 2013', 'kforce': 5, 'lowfreq': -0.7, 'scale': 0.005})[source]¶

Parameters:

start – first bin to model (bin number)
end – last bin to model (bin number)
n_models (5000) – number of modes to generate
n_keep (1000) – number of models used in the final analysis (usually the top 20% of the generated models). The models are ranked according to their objective function value (the lower the better)
keep_all (False) – whether or not to keep the discarded models (if True, models will be stored under tructuralModels.bad_models)
close_bins (1) – number of particles away (i.e. the bin number difference) a particle pair must be in order to be considered as neighbors (e.g. 1 means consecutive particles)
n_cpus – number of CPUs to use
verbose (0) – the information printed can be: nothing (0), the objective function value the selected models (1), the objective function value of all the models (2), all the modeling information (3)

:param CONFIG[‘dmel_01’] a dictionary containing the standard

parameters used to generate the models. The dictionary should contain the keys kforce, maxdist, upfreq and lowfreq. Examples can be seen by doing:

from pytadbit.imp.CONFIG import CONFIG

where CONFIG is a dictionarry of dictionnaries to be passed to this function:

::

CONFIG = {
 'dmel_01': {
     # use these paramaters with the Hi-C data from:
     'reference' : 'victor corces dataset 2013',

     # Force applied to the restraints inferred to neighbor particles
     'kforce'    : 5,

     # Maximum experimental contact distance
     'maxdist'   : 600, # OPTIMIZATION: 500-1200

     # Minimum and maximum thresholds used to decide which experimental values have to be
     # included in the computation of restraints. Z-score values bigger than upfreq
     # and less that lowfreq will be include, whereas all the others will be rejected
     'upfreq'    : 0.3, # OPTIMIZATION: min/max Z-score

     'lowfreq'   : -0.7 # OPTIMIZATION: min/max Z-score

     }
 }

normalize_hic(method='visibility')[source]¶

Normalize the Hi-C data. This normalization step does the same of the pytadbit.tadbit.tadbit() function (default parameters),

It fills the Experiment.norm variable with the Hi-C values divided by the calculated weight.

The weight of a given cell in column i and row j corresponds to the square root of the product of the sum of column i by the sum of row j.

Parameters:

method (visibility) –

either ‘sqrt’ or ‘visibility’. Depending on this parameter, the weight of the Hi-C count in row I, column J of the Hi-C matrix will be, under ‘sqrt’:

                       _________________________________________
             \        / N                    N                  |
              \      / ___                  ___             
weight(I,J) =  \    /  \                    \           
                \  /   /__ (matrix(i, J)) * /__  (matrix(I, j))
                 \/    i=0                  j=0

and under ‘visibility’:

                 N                    N                 
                ___                  ___                
                \                    \                  
                /__ (matrix(i, J)) * /__  (matrix(I, j))
                i=0                  j=0                
weight(I,J) =  -----------------------------------------         
                         N     N                                 
                        ___   ___                                
                        \     \                                  
                        /__   /__ (matrix(i, j))
                        j=0   i=0

with N being the number or rows/columns of the Hi-C matrix in both cases.

Note that the default behavior (also used in pytadbit.tadbit.tadbit()) corresponds to method=’sqrt’.

optimal_imp_parameters(start, end, n_models=500, n_keep=100, n_cpus=1, upfreq_range=(0, 1, 0.1), close_bins=1, lowfreq_range=(-1, 0, 0.1), scale_range=[0.005], maxdist_range=(400, 1400), cutoff=300, outfile=None, verbose=True)[source]¶

Find the optimal set of parameters to be used for the 3D modeling in IMP.

Parameters:

start – first bin to model (bin number)
end – last bin to model (bin number)
n_models (500) – number of modes to generate
n_keep (100) – number of models used in the final analysis (usually the top 20% of the generated models). The models are ranked according to their objective function value (the lower the better)
close_bins (1) – number of particles away (i.e. the bin number difference) a particle pair must be in order to be considered as neighbors (e.g. 1 means consecutive particles)
n_cpus – number of CPUs to use
verbose (True) – if set to True, information about the distance, force and Z-score between particles will be printed
lowfreq_range ((-1,0,0.1)) – range of lowfreq values to be optimized. The last value of the input tuple is the incremental step for the lowfreq values
upfreq_range ((0,1,0.1,0.1)) – range of upfreq values to be optimized. The last value of the input tuple is the incremental step for the upfreq values
maxdist_range ((400,1400,100)) – upper and lower bounds used to search for the optimal maximum experimental distance. The last value of the input tuple is the incremental step for maxdist values
scale_range ([0.01]) – upper and lower bounds used to search for the optimal scale parameter (nm per nucleotide). The last value of the input tuple is the incremental step for scale parameter values
verbose – print the results to the standard output

Note

Each of the _range parameters accept tuples in the form

(start, end, step), or a list with the list of values to test.

E.g.:

scale_range=[0.001, 0.005, 0.006] will test these three values.
scale_range=(0.001, 0.005, 0.001) will test the values 0.001, 0.002, 0.003, 0.004 and 0.005

Returns:	a tuple containing: a 3D numpy array with the values of the correlations calculated the range of scale used the range of maxdist used the range of upfreq used the range of lowfreq used

print_hic_matrix(print_it=True)[source]¶

Return the Hi-C matrix as string

Returns:	list of lists representing the Hi-C data matrix of the current experiment

set_resolution(resolution, keep_original=True)[source]¶

Set a new value for the resolution. Copy the original data into Experiment._ori_hic and replace the Experiment.hic_data with the data corresponding to new data (pytadbit.Chromosome.compare_condition()).

Parameters:	resolution – an integer representing the resolution. This number must be a multiple of the original resolution, and higher than it keep_original (True) – either to keep or not the original data

write_interaction_pairs(fname, normalized=True, zscored=True, diagonal=False, cutoff=None, header=False, true_position=False, uniq=True, remove_zeros=False, focus=None)[source]¶

Creates a tab separated file with all the pairwise interactions.

Parameters:

fname – file name where to write the pairwise interactions
zscored (True) – computes the z-score of the log10(data)
normalized (True) – use the weights to normalize the data
cutoff (None) – if defined, only the zscores above the cutoff will be writen to the file
uniq (False) – only writes one representent per interacting pair
true_position (False) – if, true writes genomic coordinates, otherwise, genomic bin.
focus (None) – writes interactions between the start and stop bin passed to this parameter.

Experiment class¶

Table Of Contents

Previous topic

Next topic

Navigation

Experiment class¶

Table Of Contents

Previous topic

Next topic

Quick search

Navigation