Experiment class

class pytadbit.experiment.Experiment(name, resolution, hic_data=None, tad_def=None, parser=None, no_warn=False, weights=None, conditions=None, filter_columns=True)[source]

Hi-C experiment.

Parameters:
  • name – name of the experiment
  • resolution – the resolution of the experiment (size of a bin in bases)
  • hic_data (None) – whether a file or a list of lists corresponding to the Hi-C data
  • tad_def (None) – a file or a dict with precomputed TADs for this experiment
  • parser (None) –

    a parser function that returns a tuple of lists representing the data matrix and the length of a row/column. With the file example.tsv:

    chrT_001       chrT_002        chrT_003        chrT_004
    chrT_001       629     164     88      105
    chrT_002       164     612     175     110
    chrT_003       88      175     437     100
    chrT_004       105     110     100     278

    the output of parser(‘example.tsv’) would be be: [([629, 164, 88, 105, 164, 612, 175, 110, 88, 175, 437, 100, 105, 110, 100, 278]), 4]

  • conditions (None) – list() of experimental conditions, e.g. the cell type, the enzyme... (i.e.: [‘HindIII’, ‘cortex’, ‘treatment’]). This parameter may be used to compare the effect of this conditions on the TADs
  • filter_columns (True) – filter the columns with unexpectedly high content of low values

TODO: doc conditions TODO: normalization

get_hic_matrix()[source]

Return the Hi-C matrix.

Returns:list of lists representing the Hi-C data matrix of the current experiment
get_hic_zscores(normalized=True, zscored=True, remove_zeros=True)[source]

Normalize the Hi-C raw data. The result will be stored into the private Experiment._zscore list.

Parameters:
  • normalized (True) – whether to normalize the result using the weights (see normalize_hic())
  • zscored (True) – calculate the z-score of the data
  • remove_zeros (True) – remove null interactions
load_hic_data(hic_data, parser=None, wanted_resolution=None, data_resolution=None, filter_columns=True)[source]

Add a Hi-C experiment to the Chromosome object.

Parameters:
  • hic_data (None) – whether a file or a list of lists corresponding to the Hi-C data
  • name – name of the experiment
  • force (False) – overwrite the experiments loaded under the same name
  • parser (None) –

    a parser function that returns a tuple of lists representing the data matrix and the length of a row/column. With the file example.tsv:

    chrT_001   chrT_002        chrT_003        chrT_004
    chrT_001   629     164     88      105
    chrT_002   86      612     175     110
    chrT_003   159     216     437     105
    chrT_004   100     111     146     278

    the output of parser(‘example.tsv’) would be: [([629, 86, 159, 100, 164, 612, 216, 111, 88, 175, 437, 146, 105, 110, 105, 278]), 4]

  • resolution (None) – resolution of the experiment in the file; it will be adjusted to the resolution of the experiment. By default the file is expected to contain a Hi-C experiment with the same resolution as the pytadbit.Experiment created, and no change is made
  • filter_columns (True) – filter the columns with unexpectedly high content of low values
load_tad_def(tad_def, weights=None)[source]

Add the Topologically Associated Domains definition detection to Slice

Parameters:
  • tad_def (None) – a file or a dict with precomputed TADs for this experiment
  • name (None) – name of the experiment, if None f_name will be used
  • weights (None) – Store information about the weights, corresponding to the normalization of the Hi-C data (see tadbit function documentation)
model_region(start, end, n_models=5000, n_keep=1000, n_cpus=1, verbose=0, keep_all=False, close_bins=1, outfile=None, config={'maxdist': 600, 'upfreq': 0.3, 'reference': 'victor corces dataset 2013', 'kforce': 5, 'lowfreq': -0.7, 'scale': 0.005})[source]
Parameters:
  • start – first bin to model (bin number)
  • end – last bin to model (bin number)
  • n_models (5000) – number of modes to generate
  • n_keep (1000) – number of models used in the final analysis (usually the top 20% of the generated models). The models are ranked according to their objective function value (the lower the better)
  • keep_all (False) – whether or not to keep the discarded models (if True, models will be stored under tructuralModels.bad_models)
  • close_bins (1) – number of particles away (i.e. the bin number difference) a particle pair must be in order to be considered as neighbors (e.g. 1 means consecutive particles)
  • n_cpus – number of CPUs to use
  • verbose (0) – the information printed can be: nothing (0), the objective function value the selected models (1), the objective function value of all the models (2), all the modeling information (3)
:param CONFIG[‘dmel_01’] a dictionary containing the standard

parameters used to generate the models. The dictionary should contain the keys kforce, maxdist, upfreq and lowfreq. Examples can be seen by doing:

from pytadbit.imp.CONFIG import CONFIG

where CONFIG is a dictionarry of dictionnaries to be passed to this function:

::

CONFIG = {
 'dmel_01': {
     # use these paramaters with the Hi-C data from:
     'reference' : 'victor corces dataset 2013',

     # Force applied to the restraints inferred to neighbor particles
     'kforce'    : 5,

     # Maximum experimental contact distance
     'maxdist'   : 600, # OPTIMIZATION: 500-1200

     # Minimum and maximum thresholds used to decide which experimental values have to be
     # included in the computation of restraints. Z-score values bigger than upfreq
     # and less that lowfreq will be include, whereas all the others will be rejected
     'upfreq'    : 0.3, # OPTIMIZATION: min/max Z-score

     'lowfreq'   : -0.7 # OPTIMIZATION: min/max Z-score

     }
 }
normalize_hic(method='visibility')[source]

Normalize the Hi-C data. This normalization step does the same of the pytadbit.tadbit.tadbit() function (default parameters),

It fills the Experiment.norm variable with the Hi-C values divided by the calculated weight.

The weight of a given cell in column i and row j corresponds to the square root of the product of the sum of column i by the sum of row j.

Parameters:method (visibility) –

either ‘sqrt’ or ‘visibility’. Depending on this parameter, the weight of the Hi-C count in row I, column J of the Hi-C matrix will be, under ‘sqrt’:

                       _________________________________________
             \        / N                    N                  |
              \      / ___                  ___             
weight(I,J) =  \    /  \                    \           
                \  /   /__ (matrix(i, J)) * /__  (matrix(I, j))
                 \/    i=0                  j=0

and under ‘visibility’:

                 N                    N                 
                ___                  ___                
                \                    \                  
                /__ (matrix(i, J)) * /__  (matrix(I, j))
                i=0                  j=0                
weight(I,J) =  -----------------------------------------         
                         N     N                                 
                        ___   ___                                
                        \     \                                  
                        /__   /__ (matrix(i, j))
                        j=0   i=0                                

with N being the number or rows/columns of the Hi-C matrix in both cases.

Note that the default behavior (also used in pytadbit.tadbit.tadbit()) corresponds to method=’sqrt’.

optimal_imp_parameters(start, end, n_models=500, n_keep=100, n_cpus=1, upfreq_range=(0, 1, 0.1), close_bins=1, lowfreq_range=(-1, 0, 0.1), scale_range=[0.005], maxdist_range=(400, 1400), cutoff=300, outfile=None, verbose=True)[source]

Find the optimal set of parameters to be used for the 3D modeling in IMP.

Parameters:
  • start – first bin to model (bin number)
  • end – last bin to model (bin number)
  • n_models (500) – number of modes to generate
  • n_keep (100) – number of models used in the final analysis (usually the top 20% of the generated models). The models are ranked according to their objective function value (the lower the better)
  • close_bins (1) – number of particles away (i.e. the bin number difference) a particle pair must be in order to be considered as neighbors (e.g. 1 means consecutive particles)
  • n_cpus – number of CPUs to use
  • verbose (True) – if set to True, information about the distance, force and Z-score between particles will be printed
  • lowfreq_range ((-1,0,0.1)) – range of lowfreq values to be optimized. The last value of the input tuple is the incremental step for the lowfreq values
  • upfreq_range ((0,1,0.1,0.1)) – range of upfreq values to be optimized. The last value of the input tuple is the incremental step for the upfreq values
  • maxdist_range ((400,1400,100)) – upper and lower bounds used to search for the optimal maximum experimental distance. The last value of the input tuple is the incremental step for maxdist values
  • scale_range ([0.01]) – upper and lower bounds used to search for the optimal scale parameter (nm per nucleotide). The last value of the input tuple is the incremental step for scale parameter values
  • verbose – print the results to the standard output

Note

Each of the _range parameters accept tuples in the form

(start, end, step), or a list with the list of values to test.

E.g.:
  • scale_range=[0.001, 0.005, 0.006] will test these three values.
  • scale_range=(0.001, 0.005, 0.001) will test the values 0.001, 0.002, 0.003, 0.004 and 0.005
Returns:a tuple containing:
  • a 3D numpy array with the values of the correlations calculated
  • the range of scale used
  • the range of maxdist used
  • the range of upfreq used
  • the range of lowfreq used
print_hic_matrix(print_it=True)[source]

Return the Hi-C matrix as string

Returns:list of lists representing the Hi-C data matrix of the current experiment
set_resolution(resolution, keep_original=True)[source]

Set a new value for the resolution. Copy the original data into Experiment._ori_hic and replace the Experiment.hic_data with the data corresponding to new data (pytadbit.Chromosome.compare_condition()).

Parameters:
  • resolution – an integer representing the resolution. This number must be a multiple of the original resolution, and higher than it
  • keep_original (True) – either to keep or not the original data
write_interaction_pairs(fname, normalized=True, zscored=True, diagonal=False, cutoff=None, header=False, true_position=False, uniq=True, remove_zeros=False, focus=None)[source]

Creates a tab separated file with all the pairwise interactions.

Parameters:
  • fname – file name where to write the pairwise interactions
  • zscored (True) – computes the z-score of the log10(data)
  • normalized (True) – use the weights to normalize the data
  • cutoff (None) – if defined, only the zscores above the cutoff will be writen to the file
  • uniq (False) – only writes one representent per interacting pair
  • true_position (False) – if, true writes genomic coordinates, otherwise, genomic bin.
  • focus (None) – writes interactions between the start and stop bin passed to this parameter.