genomfart.utils package¶
Subpackages¶
Submodules¶
genomfart.utils.bigDataFrame module¶
-
class
genomfart.utils.bigDataFrame.BigDataFrame(filename, header=True, sep='t', detect_type=True, assume_uniform_types=True, ignore_chars=None, maxsize=10000, byte_record_interval=10000)[source]¶ Bases:
objectClass acting as a DataFrame for large datasets
Not all of the dataset is loaded into memory. Rather, a cache of rows is maintained along with a handle and a position in the file
Methods
get_current_row_ind()Gets the index of the current row iterrows([cache, colspec])Iterates through rows in the dataframe, with the option of whether make_numpy_array([rows, cols])Create a 2-dimensional numpy array of data in the data frame. -
__init__(filename, header=True, sep='\t', detect_type=True, assume_uniform_types=True, ignore_chars=None, maxsize=10000, byte_record_interval=10000)[source]¶ Instantiates the big data frame
Parameters: filename : str
The file containing delimited data
header : boolean, optional
Whether the file has a header
rowLabs : int, optional
0-based column containing headers for the rows
sep : str, optional
The separateor of columns in the file (Can be a compiled regular expression)
detect_type : boolean, optional
Whether type should be guessed on loading. If false, everythin returned as str
assume_uniform_types : boolean, optional
Whether it should be assumed that the types in the first row apply to all rows. Only matters if detect_type is True
ignore_chars : Set
Characters that signal a line in the file should be skipped if it starts with one of those characters
maxsize : int, optional
The maximum number of rows to cache
byte_record_interval : int, optional
How often to record the starting byte of a row
-
get_current_row_ind()[source]¶ Gets the index of the current row
Returns: The zero-based index of the current row
-
iterrows(cache=False, colspec=None)[source]¶ Iterates through rows in the dataframe, with the option of whether or not to cache rows while moving through
Parameters: cache : boolean
Whether or not to cache rows while iterating. Choosing False will result in a performance increase while iterrating, but O(1) access to iterrated rows will not be available afterward
colspec : iterable
Particular columns (index, or name if there is a header) that should be included. If given, all other columns will be discarded
Returns: Generator of dictionaries for each row
-
make_numpy_array(rows=None, cols=None)[source]¶ Create a 2-dimensional numpy array of data in the data frame.
Parameters: rows : list of ints or Ranger.RangeSet
Either a list of rows to include in the array or a RangeSet that includes all row indices to be included in the RangeSet. If None, all rows will be included
cols : list of ints or Ranger.RangeSet
Either a list of columns to include in the array or a RangeSet that includes all columns to be included in the RangeSet. If None, all columns will be included
Returns: A numpy array containing the data
-
genomfart.utils.caching module¶
-
class
genomfart.utils.caching.LRUcache(retrieveFunc, delFunc=None, maxsize=1000, *args, **kwargs)[source]¶ Bases:
dictA least recently used cache
Examples
>>> from genomfart.utils.caching import LRUcache >>> >>> retreiveFunc = lambda k: k.upper() >>> def disposeFunc(k,v): ... print('%s Test' % v) ... return >>> >>> a = LRUcache(retreiveFunc, delFunc=disposeFunc, maxsize=3)
Methods
clear(() -> None. Remove all items from D.)copy(() -> a shallow copy of D)fromkeys(...)v defaults to None. get((k[,d]) -> D[k] if k in D, ...)has_key((k) -> True if D has a key k, else False)items(() -> list of D’s (key, value) pairs, ...)iteritems(() -> an iterator over the (key, ...)iterkeys(() -> an iterator over the keys of D)itervalues(...)keys(() -> list of D’s keys)pop((k[,d]) -> v, ...)If key is not found, d is returned if given, otherwise KeyError is raised popitem(() -> (k, v), ...)2-tuple; but raise KeyError if D is empty. setdefault((k[,d]) -> D.get(k,d), ...)update(([E, ...)If E present and has a .keys() method, does: for k in E: D[k] = E[k] values(() -> list of D’s values)viewitems(...)viewkeys(...)viewvalues(...)-
__init__(retrieveFunc, delFunc=None, maxsize=1000, *args, **kwargs)[source]¶ Instantiates the LRUcache
Parameters: retrieveFunc : callable
Function that takes a key as an argument and returns the value corresponding to that key
delFunc : callable
Function that takes key, value as an arguments and does something before deletion
maxsize : int
The maximum size the cache can keep before kicking keys out
-
genomfart.utils.genomeAnnotationGraph module¶
-
class
genomfart.utils.genomeAnnotationGraph.genomeAnnotationGraph[source]¶ Bases:
objectRepresentation of a genome, in which annotations in the genome are Ranges in a RangeBucketMap, and annotations can be hierarchical (e.g. transcripts are the children of genes). The hierarchy is stored as a directed graph
Methods
add_annotation(element_id, seqid, start, ...)Adds an annotation to the genome add_node_annotations(element_id, **annots)Adds annotations to all annotation dictionaries under a node. get_aa_indices(seqid, pos[, cds_type])Gets the indices (base-1) of the amino acid position in any get_cds_indices(seqid, pos[, cds_type])Gets the indices (base-1) of the nucleotide position in any get_closest_element_id(seqid, rangeStart, ...)Gets the element id(s) of the whatever element is closest to a range get_closest_element_id_of_type(seqid, ...[, ...])Gets the element id(s) of the whatever element is closest to a range get_codon_position(seqid, pos[, cds_type])Gets the indices (base-1) of the codon position (i.e. get_element_children_ids(element_id)Gets the ids of the children of an element get_element_ids_of_type(seqid, element_type)Gets element ids of some type along a coordinate system get_element_info(element_id)Gets information on a particular element get_element_parent_ids(element_id)Gets the ids of the parents of an element get_overlapping_element_ids(seqid, start, end)Gets the ids for any elements that overlap a given range get_overlapping_element_ids_of_type(seqid, ...)Gets the ids for any elements that overlap a given range and are overlaps_type(seqid, start, end, element_type)Returns True if the given range overlaps an element of the given type -
add_annotation(element_id, seqid, start, end, element_type, strand='?', parents=None, children=None, **attr)[source]¶ Adds an annotation to the genome
Note that you can add more than one annotation under an element id, so this can be preexisting. If it is preexisting, the Range and attribute dictionaries will be appended
Parameters: element_id : hashable
The id of the element you’re adding.
seqid : hashable
The name of the coordinate system (e.g. the chromosome)
start : int
The start of the element (inclusive)
end : int
The end of the element (inclusive)
element_type : str
The type of the element (e.g. ‘gene’ or ‘mRNA’)
strand : str, optional
The element strand (‘+’,’-‘,’?’)
parents : iterable of strings, optional
The parent element_ids of this element_id
children : iterable of strings, optional
The child element_ids of this element_id
attr : Further keyword arguments that will be stored as attributes
of the element
-
add_node_annotations(element_id, **annots)[source]¶ Adds annotations to all annotation dictionaries under a node. Note that if the key for an annotation is the same as a previously existing one, that annotation will be overwritten
Parameters: element_id : hashable
The id of the element you’re adding.
annots : keyword arguments
The elements you want to add
Examples
>>> my_genome.add_node_annotations('myNode1', expression1 = 0.2, expression2 = 10.)
-
get_aa_indices(seqid, pos, cds_type='CDS')[source]¶ Gets the indices (base-1) of the amino acid position in any CDS overlapping it
Parameters: seqid : str
The name of the coordinate system to check
pos : int
The position to check (1-based)
cds_type : str, optional
The element type containing the CDS ranges
Returns: Dictionary of CDS_ID -> AA position in protein (1-based)
-
get_cds_indices(seqid, pos, cds_type='CDS')[source]¶ Gets the indices (base-1) of the nucleotide position in any CDS overlapping it
Parameters: seqid : str
The name of the coordinate system to check
pos : int
The position to check (1-based)
cds_type : str, optional
The element type containing the CDS ranges
Returns: Dictionary of CDS_ID -> position in CDS (1-based)
-
get_closest_element_id(seqid, rangeStart, rangeEnd, radius=10000)[source]¶ Gets the element id(s) of the whatever element is closest to a range
Parameters: seqid : str
The name of the coordinate system to check
rangeStart : int
The position (inclusive) beginning the range for which you want the closest element
rangeEnd : int
The position (inclusive) ending the range for which you want the closest element
radius : int
How far on either side of the search range you want to search for the closest element
Returns: Set that will contain all element ids overlapping the search range
if there are any. Otherwise, it will contain the element(s) with the
shortest distance to the search range (only more than 1 if some elements
are equidistant)
-
get_closest_element_id_of_type(seqid, rangeStart, rangeEnd, element_type, radius=10000)[source]¶ Gets the element id(s) of the whatever element is closest to a range
Parameters: seqid : str
The name of the coordinate system to check
rangeStart : int
The position (inclusive) beginning the range for which you want the closest element
rangeEnd : int
The position (inclusive) ending the range for which you want the closest element
radius : int
How far on either side of the search range you want to search for the closest element
element_type : str
The type of the elements you want (e.g. ‘gene’ or ‘mRNA’)
Returns: Set that will contain all element ids overlapping the search range
if there are any. Otherwise, it will contain the element(s) with the
shortest distance to the search range (only more than 1 if some elements
are equidistant)
-
get_codon_position(seqid, pos, cds_type='CDS')[source]¶ Gets the indices (base-1) of the codon position (i.e. 1,2, or 3) within any CDS
Parameters: seqid : str
The name of the coordinate system to check
pos : int
The position to check (1-based)
cds_type : str, optional
The element type containing the CDS ranges
Returns: Dictionary of CDS_ID -> codon position
-
get_element_children_ids(element_id)[source]¶ Gets the ids of the children of an element
Parameters: element_id : str
The element for which you want the children
Returns: List of the children IDs of the element
Raises: KeyError
If the element is not present
-
get_element_ids_of_type(seqid, element_type, start=None, end=None)[source]¶ Gets element ids of some type along a coordinate system
Parameters: seqid : str
The name of the coordinate system to check
element_type : str
The type of the elements you want (e.g. ‘gene’ or ‘mRNA’)
start : int, optional
The start point for getting the elements (inclusive, 1-based)
end : int, optional
The end point for getting the elements (inclusive, 1-based)
Returns: Generator of element_ids
Raises: KeyError
If the seqid isn’t included
-
get_element_info(element_id)[source]¶ Gets information on a particular element
Parameters: element_id : str
The id of the element
Returns: Dictionary of {‘seqid’ -> seqid, type’->type, ‘strand’->’strand’, ‘Ranges’->[Ranges],
‘attributes’->[attribute_dicts]}
-
get_element_parent_ids(element_id)[source]¶ Gets the ids of the parents of an element
Parameters: element_id : str
The element for which you want the parents
Returns: List of the parent IDs of the element
Raises: KeyError
If the element is not present
-
get_overlapping_element_ids(seqid, start, end)[source]¶ Gets the ids for any elements that overlap a given range
Parameters: seqid : str
The name of the coordinate system to check
start : int
The start of the range to check (inclusive, 1-based)
end : int
The end of the range to check (inclusive, 1-based)
Returns: Set of ids for elements overlapping the range
Raises: KeyError
If the seqid is not present
-
get_overlapping_element_ids_of_type(seqid, start, end, element_type)[source]¶ Gets the ids for any elements that overlap a given range and are of a given type
Parameters: seqid : str
The name of the coordinate system to check
start : int
The start of the range to check (inclusive, 1-based)
end : int
The end of the range to check (inclusive, 1-based)
element_type : str
The element type that you want (e.g. ‘gene’, ‘mRNA’)
Returns: Set of ids for elements overlapping the range that are of the
element type
Raises: KeyError
If the seqid is not present
-
overlaps_type(seqid, start, end, element_type)[source]¶ Returns True if the given range overlaps an element of the given type
Parameters: seqid : str
The name of the coordinate system to check
start : int
The start of the range to check (inclusive, 1-based)
end : int
The end of the range to check (inclusive, 1-based)
element_type : str
The element type that you want (e.g. ‘gene’, ‘mRNA’)
Returns: True if the range overlaps at least 1 element of the give type, else False
Raises: KeyError
If the seqid is not present
-
genomfart.utils.polymorphism_formatter module¶
-
class
genomfart.utils.polymorphism_formatter.plink[source]¶ Static class used to work with PLINK-formatted data
Methods
write_geno_mat_to_ped(write_base, geno_mat, ...)Writes MAP and PED files for a given set of genotypes -
static
write_geno_mat_to_ped(write_base, geno_mat, samples, positions, chrom, families=None)[source]¶ Writes MAP and PED files for a given set of genotypes
Parameters: write_base : str
The base name of the resulting plink files
geno_mat : np.ndarray
A numpy matrix of integer genotypes (0,1,2). All markers are assumed to be biallelic. Rows are samples, and columns are positions
samples : iterable
The sample names, in the same order as matrix rows
positions : iterable
The positions of the markers, in the same order as matrix columns
chrom : str
The chromosome name
families : iterable, optional
Family name for each individual, if they are grouped by family
-
static
genomfart.utils.snp_projector module¶
-
class
genomfart.utils.snp_projector.snp_projector(chromosome, chrom_length, mapFile, rilFile, useAgpV2=False)[source]¶ Class used to project SNPs from a set of founders onto descendants
Methods
importMarkersForMap(rilFile)Reads the data file for a chromosome with the sample allele states projectAllSnps(chrom_length, popIndex, ...)Projects all SNPs on a chromsome onto descendants projectSnpBoolean(parents, pos, ...)Projects a SNP onto descendants if parent values are boolean -
__init__(chromosome, chrom_length, mapFile, rilFile, useAgpV2=False)[source]¶ Instantiates the projector
Parameters: chromosome : int
Chromosome to project onto
chrom_length : int
Length of the chromosome
mapFile : str
The filename for the map
rilFile : str
The filename for the RIL file
useAgpV2 : boolean
Whether this is version 2 of the AGPmap format
-
importMarkersForMap(rilFile)[source]¶ Reads the data file for a chromosome with the sample allele states
Parameters: rilFile : str
The filename for the RIL file
-
projectAllSnps(chrom_length, popIndex, founder_data, boolean=True, positions=None, enumerate_snps=False)[source]¶ Projects all SNPs on a chromsome onto descendants
Parameters: chrom_length : int
The length of the chromosome
popIndex : np.ndarray, int
Indices of the population for each sample
founder_data : genomfart.parsers.SNPdata
Object containing the founder SNP data
boolean : boolean
Whether the parents returned have boolean values for SNPs
positions : set of ints
A set of specific positions to project. If None, all SNPs will be projected
enumerate_snps : boolean
Whether to return both the absolute value of the SNP index along with the positions
Returns: Generator non-reference allele count for each RIL, as an array of
doubles
Examples
>>> from genomfart.utils.snp_projector import snp_projector >>> from genomfart.parsers.SNPdata import SNPdata >>> import numpy as np >>> import re >>> import os >>> chrom = 10 >>> chrom_length = 149686046 >>> home_dir = os.path.expanduser('~') >>> map_dir = os.path.join(home_dir,'Documents','Zea','Data','Sequence', ... 'NAM_imputed_maps','imputedMarkers.0.1cm') >>> founder_dir = os.path.join(home_dir,'Documents','Zea','Data','Sequence', ... 'NAM_founder_genos') >>> founder_file = os.path.join(founder_dir, ... 'gwas_snps_union_chr%d.h1.h2_minors_indels.cnv_genic_2kwin_500bpbin.20130605.txt' % chrom) >>> map_file = os.path.join(map_dir,'Imputed_0.1cm_master.txt') >>> ril_file = os.path.join(map_dir,'Imputed_0.1cm_chr%d_clean.txt' % chrom) >>> projector = snp_projector(chrom,chrom_length,map_file,ril_file) >>> popIndex = np.array(map(lambda x: (17 if x[0] == 'M' else int(re.search('(?<=Z)\d+(?=E)',x).group()))-1,) ... projector.samp_names), dtype=np.int64) >>> founder_data = SNPdata(chrom, founder_file, 'B73') >>> projection = projector.projectAllSnps(chrom_length, popIndex, founder_data)
-
genomfart.utils.version_mapper module¶
-
class
genomfart.utils.version_mapper.version_mapper(map_file)[source]¶ Attributes
graph Methods
v1_to_v2_map(v1_chr, v1_pos)Gets the position of the version 2 genome if it exists (base 1 assumed) v1_to_v2_seg_map(v1_chr, v1_start, v1_end)Gets the ranges of positions in the version 2 genome if they exist (base 1 assumed) v2_to_v1_map(v2_chr, v2_pos)Gets the position of the version 1 genome if it exists (base 1 assumed) v2_to_v1_seg_map(v2_chr, v2_start, v2_end)Gets the ranges of positions in the version 1 genome if they exist (base 1 assumed) -
__init__(map_file)[source]¶ Instantiates the version mapper
Parameters: map_file : str
A file that maps between the two assemblies. It should have columns v1_chrom, v1_start, v1_end, v2_chrom, v2_start, v2_end, orientation. It is also assumed to be base-1
-
graph= None¶
-
v1_end_dict= {}¶
-
v1_to_v2_map(v1_chr, v1_pos)[source]¶ Gets the position of the version 2 genome if it exists (base 1 assumed)
Parameters: v1_chr : hashable
The version 1 chromosome
v1_pos : int
The version 1 position
Returns: (v2_chrom,v2_pos,orientation relative to v1) if the position exists in version 2
or None if it doesn’t exist in version 2
-
v1_to_v2_seg_map(v1_chr, v1_start, v1_end)[source]¶ Gets the ranges of positions in the version 2 genome if they exist (base 1 assumed)
Parameters: v1_chr : hashable
The version 1 chromosome
v1_start : int
The version 1 start position (inclusive)
v1_end : int
The version 1 end position (inclusive)
Returns: Dictionary of (v1_chrom,v1_start,v1_end)->(v2_chrom,v2_start,v2_end,orientatoin relative to v1) for
ranges of positions existing in version 2
-
v2_end_dict= {}¶
-
v2_to_v1_map(v2_chr, v2_pos)[source]¶ Gets the position of the version 1 genome if it exists (base 1 assumed)
Parameters: v2_chr : hashable
The version 2 chromosome
v2_pos : int
The version 2 position
Returns: (v1_chrom,v1_pos,orientation relative to v2) if the position exists in version 1
or None if it doesn’t exist in version 1
-
v2_to_v1_seg_map(v2_chr, v2_start, v2_end)[source]¶ Gets the ranges of positions in the version 1 genome if they exist (base 1 assumed)
Parameters: v2_chr : hashable
The version 2 chromosome
v2_start : int
The version 2 start position (inclusive)
v2_end : int
The version 2 end position (inclusive)
Returns: Dictionary of (v2_chrom,v2_start,v2_end)->(v1_chrom,v1_start,v1_end,orientation relative to v2) for
ranges of positions existing in version 1
-