genomfart.utils package

Submodules

genomfart.utils.bigDataFrame module

class genomfart.utils.bigDataFrame.BigDataFrame(filename, header=True, sep='t', detect_type=True, assume_uniform_types=True, ignore_chars=None, maxsize=10000, byte_record_interval=10000)[source]

Bases: object

Class acting as a DataFrame for large datasets

Not all of the dataset is loaded into memory. Rather, a cache of rows is maintained along with a handle and a position in the file

Methods

get_current_row_ind() Gets the index of the current row
iterrows([cache, colspec]) Iterates through rows in the dataframe, with the option of whether
make_numpy_array([rows, cols]) Create a 2-dimensional numpy array of data in the data frame.
__init__(filename, header=True, sep='\t', detect_type=True, assume_uniform_types=True, ignore_chars=None, maxsize=10000, byte_record_interval=10000)[source]

Instantiates the big data frame

Parameters:

filename : str

The file containing delimited data

header : boolean, optional

Whether the file has a header

rowLabs : int, optional

0-based column containing headers for the rows

sep : str, optional

The separateor of columns in the file (Can be a compiled regular expression)

detect_type : boolean, optional

Whether type should be guessed on loading. If false, everythin returned as str

assume_uniform_types : boolean, optional

Whether it should be assumed that the types in the first row apply to all rows. Only matters if detect_type is True

ignore_chars : Set

Characters that signal a line in the file should be skipped if it starts with one of those characters

maxsize : int, optional

The maximum number of rows to cache

byte_record_interval : int, optional

How often to record the starting byte of a row

get_current_row_ind()[source]

Gets the index of the current row

Returns:The zero-based index of the current row
iterrows(cache=False, colspec=None)[source]

Iterates through rows in the dataframe, with the option of whether or not to cache rows while moving through

Parameters:

cache : boolean

Whether or not to cache rows while iterating. Choosing False will result in a performance increase while iterrating, but O(1) access to iterrated rows will not be available afterward

colspec : iterable

Particular columns (index, or name if there is a header) that should be included. If given, all other columns will be discarded

Returns:

Generator of dictionaries for each row

make_numpy_array(rows=None, cols=None)[source]

Create a 2-dimensional numpy array of data in the data frame.

Parameters:

rows : list of ints or Ranger.RangeSet

Either a list of rows to include in the array or a RangeSet that includes all row indices to be included in the RangeSet. If None, all rows will be included

cols : list of ints or Ranger.RangeSet

Either a list of columns to include in the array or a RangeSet that includes all columns to be included in the RangeSet. If None, all columns will be included

Returns:

A numpy array containing the data

genomfart.utils.caching module

class genomfart.utils.caching.LRUcache(retrieveFunc, delFunc=None, maxsize=1000, *args, **kwargs)[source]

Bases: dict

A least recently used cache

Examples

>>> from genomfart.utils.caching import LRUcache
>>>
>>> retreiveFunc = lambda k: k.upper()
>>> def disposeFunc(k,v):
...     print('%s Test' % v)
...     return
>>>
>>> a = LRUcache(retreiveFunc, delFunc=disposeFunc, maxsize=3)

Methods

clear(() -> None.  Remove all items from D.)
copy(() -> a shallow copy of D)
fromkeys(...) v defaults to None.
get((k[,d]) -> D[k] if k in D, ...)
has_key((k) -> True if D has a key k, else False)
items(() -> list of D’s (key, value) pairs, ...)
iteritems(() -> an iterator over the (key, ...)
iterkeys(() -> an iterator over the keys of D)
itervalues(...)
keys(() -> list of D’s keys)
pop((k[,d]) -> v, ...) If key is not found, d is returned if given, otherwise KeyError is raised
popitem(() -> (k, v), ...) 2-tuple; but raise KeyError if D is empty.
setdefault((k[,d]) -> D.get(k,d), ...)
update(([E, ...) If E present and has a .keys() method, does: for k in E: D[k] = E[k]
values(() -> list of D’s values)
viewitems(...)
viewkeys(...)
viewvalues(...)
__init__(retrieveFunc, delFunc=None, maxsize=1000, *args, **kwargs)[source]

Instantiates the LRUcache

Parameters:

retrieveFunc : callable

Function that takes a key as an argument and returns the value corresponding to that key

delFunc : callable

Function that takes key, value as an arguments and does something before deletion

maxsize : int

The maximum size the cache can keep before kicking keys out

genomfart.utils.genomeAnnotationGraph module

class genomfart.utils.genomeAnnotationGraph.genomeAnnotationGraph[source]

Bases: object

Representation of a genome, in which annotations in the genome are Ranges in a RangeBucketMap, and annotations can be hierarchical (e.g. transcripts are the children of genes). The hierarchy is stored as a directed graph

Methods

add_annotation(element_id, seqid, start, ...) Adds an annotation to the genome
add_node_annotations(element_id, **annots) Adds annotations to all annotation dictionaries under a node.
get_aa_indices(seqid, pos[, cds_type]) Gets the indices (base-1) of the amino acid position in any
get_cds_indices(seqid, pos[, cds_type]) Gets the indices (base-1) of the nucleotide position in any
get_closest_element_id(seqid, rangeStart, ...) Gets the element id(s) of the whatever element is closest to a range
get_closest_element_id_of_type(seqid, ...[, ...]) Gets the element id(s) of the whatever element is closest to a range
get_codon_position(seqid, pos[, cds_type]) Gets the indices (base-1) of the codon position (i.e.
get_element_children_ids(element_id) Gets the ids of the children of an element
get_element_ids_of_type(seqid, element_type) Gets element ids of some type along a coordinate system
get_element_info(element_id) Gets information on a particular element
get_element_parent_ids(element_id) Gets the ids of the parents of an element
get_overlapping_element_ids(seqid, start, end) Gets the ids for any elements that overlap a given range
get_overlapping_element_ids_of_type(seqid, ...) Gets the ids for any elements that overlap a given range and are
overlaps_type(seqid, start, end, element_type) Returns True if the given range overlaps an element of the given type
__init__()[source]

Instantiates the genomeAnnotationGraph

add_annotation(element_id, seqid, start, end, element_type, strand='?', parents=None, children=None, **attr)[source]

Adds an annotation to the genome

Note that you can add more than one annotation under an element id, so this can be preexisting. If it is preexisting, the Range and attribute dictionaries will be appended

Parameters:

element_id : hashable

The id of the element you’re adding.

seqid : hashable

The name of the coordinate system (e.g. the chromosome)

start : int

The start of the element (inclusive)

end : int

The end of the element (inclusive)

element_type : str

The type of the element (e.g. ‘gene’ or ‘mRNA’)

strand : str, optional

The element strand (‘+’,’-‘,’?’)

parents : iterable of strings, optional

The parent element_ids of this element_id

children : iterable of strings, optional

The child element_ids of this element_id

attr : Further keyword arguments that will be stored as attributes

of the element

add_node_annotations(element_id, **annots)[source]

Adds annotations to all annotation dictionaries under a node. Note that if the key for an annotation is the same as a previously existing one, that annotation will be overwritten

Parameters:

element_id : hashable

The id of the element you’re adding.

annots : keyword arguments

The elements you want to add

Examples

>>> my_genome.add_node_annotations('myNode1', expression1 = 0.2, expression2 = 10.)
get_aa_indices(seqid, pos, cds_type='CDS')[source]

Gets the indices (base-1) of the amino acid position in any CDS overlapping it

Parameters:

seqid : str

The name of the coordinate system to check

pos : int

The position to check (1-based)

cds_type : str, optional

The element type containing the CDS ranges

Returns:

Dictionary of CDS_ID -> AA position in protein (1-based)

get_cds_indices(seqid, pos, cds_type='CDS')[source]

Gets the indices (base-1) of the nucleotide position in any CDS overlapping it

Parameters:

seqid : str

The name of the coordinate system to check

pos : int

The position to check (1-based)

cds_type : str, optional

The element type containing the CDS ranges

Returns:

Dictionary of CDS_ID -> position in CDS (1-based)

get_closest_element_id(seqid, rangeStart, rangeEnd, radius=10000)[source]

Gets the element id(s) of the whatever element is closest to a range

Parameters:

seqid : str

The name of the coordinate system to check

rangeStart : int

The position (inclusive) beginning the range for which you want the closest element

rangeEnd : int

The position (inclusive) ending the range for which you want the closest element

radius : int

How far on either side of the search range you want to search for the closest element

Returns:

Set that will contain all element ids overlapping the search range

if there are any. Otherwise, it will contain the element(s) with the

shortest distance to the search range (only more than 1 if some elements

are equidistant)

get_closest_element_id_of_type(seqid, rangeStart, rangeEnd, element_type, radius=10000)[source]

Gets the element id(s) of the whatever element is closest to a range

Parameters:

seqid : str

The name of the coordinate system to check

rangeStart : int

The position (inclusive) beginning the range for which you want the closest element

rangeEnd : int

The position (inclusive) ending the range for which you want the closest element

radius : int

How far on either side of the search range you want to search for the closest element

element_type : str

The type of the elements you want (e.g. ‘gene’ or ‘mRNA’)

Returns:

Set that will contain all element ids overlapping the search range

if there are any. Otherwise, it will contain the element(s) with the

shortest distance to the search range (only more than 1 if some elements

are equidistant)

get_codon_position(seqid, pos, cds_type='CDS')[source]

Gets the indices (base-1) of the codon position (i.e. 1,2, or 3) within any CDS

Parameters:

seqid : str

The name of the coordinate system to check

pos : int

The position to check (1-based)

cds_type : str, optional

The element type containing the CDS ranges

Returns:

Dictionary of CDS_ID -> codon position

get_element_children_ids(element_id)[source]

Gets the ids of the children of an element

Parameters:

element_id : str

The element for which you want the children

Returns:

List of the children IDs of the element

Raises:

KeyError

If the element is not present

get_element_ids_of_type(seqid, element_type, start=None, end=None)[source]

Gets element ids of some type along a coordinate system

Parameters:

seqid : str

The name of the coordinate system to check

element_type : str

The type of the elements you want (e.g. ‘gene’ or ‘mRNA’)

start : int, optional

The start point for getting the elements (inclusive, 1-based)

end : int, optional

The end point for getting the elements (inclusive, 1-based)

Returns:

Generator of element_ids

Raises:

KeyError

If the seqid isn’t included

get_element_info(element_id)[source]

Gets information on a particular element

Parameters:

element_id : str

The id of the element

Returns:

Dictionary of {‘seqid’ -> seqid, type’->type, ‘strand’->’strand’, ‘Ranges’->[Ranges],

‘attributes’->[attribute_dicts]}

get_element_parent_ids(element_id)[source]

Gets the ids of the parents of an element

Parameters:

element_id : str

The element for which you want the parents

Returns:

List of the parent IDs of the element

Raises:

KeyError

If the element is not present

get_overlapping_element_ids(seqid, start, end)[source]

Gets the ids for any elements that overlap a given range

Parameters:

seqid : str

The name of the coordinate system to check

start : int

The start of the range to check (inclusive, 1-based)

end : int

The end of the range to check (inclusive, 1-based)

Returns:

Set of ids for elements overlapping the range

Raises:

KeyError

If the seqid is not present

get_overlapping_element_ids_of_type(seqid, start, end, element_type)[source]

Gets the ids for any elements that overlap a given range and are of a given type

Parameters:

seqid : str

The name of the coordinate system to check

start : int

The start of the range to check (inclusive, 1-based)

end : int

The end of the range to check (inclusive, 1-based)

element_type : str

The element type that you want (e.g. ‘gene’, ‘mRNA’)

Returns:

Set of ids for elements overlapping the range that are of the

element type

Raises:

KeyError

If the seqid is not present

overlaps_type(seqid, start, end, element_type)[source]

Returns True if the given range overlaps an element of the given type

Parameters:

seqid : str

The name of the coordinate system to check

start : int

The start of the range to check (inclusive, 1-based)

end : int

The end of the range to check (inclusive, 1-based)

element_type : str

The element type that you want (e.g. ‘gene’, ‘mRNA’)

Returns:

True if the range overlaps at least 1 element of the give type, else False

Raises:

KeyError

If the seqid is not present

genomfart.utils.polymorphism_formatter module

Static class used to work with PLINK-formatted data

Methods

write_geno_mat_to_ped(write_base, geno_mat, ...) Writes MAP and PED files for a given set of genotypes
static write_geno_mat_to_ped(write_base, geno_mat, samples, positions, chrom, families=None)[source]

Writes MAP and PED files for a given set of genotypes

Parameters:

write_base : str

The base name of the resulting plink files

geno_mat : np.ndarray

A numpy matrix of integer genotypes (0,1,2). All markers are assumed to be biallelic. Rows are samples, and columns are positions

samples : iterable

The sample names, in the same order as matrix rows

positions : iterable

The positions of the markers, in the same order as matrix columns

chrom : str

The chromosome name

families : iterable, optional

Family name for each individual, if they are grouped by family

genomfart.utils.snp_projector module

class genomfart.utils.snp_projector.snp_projector(chromosome, chrom_length, mapFile, rilFile, useAgpV2=False)[source]

Class used to project SNPs from a set of founders onto descendants

Methods

importMarkersForMap(rilFile) Reads the data file for a chromosome with the sample allele states
projectAllSnps(chrom_length, popIndex, ...) Projects all SNPs on a chromsome onto descendants
projectSnpBoolean(parents, pos, ...) Projects a SNP onto descendants if parent values are boolean
__init__(chromosome, chrom_length, mapFile, rilFile, useAgpV2=False)[source]

Instantiates the projector

Parameters:

chromosome : int

Chromosome to project onto

chrom_length : int

Length of the chromosome

mapFile : str

The filename for the map

rilFile : str

The filename for the RIL file

useAgpV2 : boolean

Whether this is version 2 of the AGPmap format

importMarkersForMap(rilFile)[source]

Reads the data file for a chromosome with the sample allele states

Parameters:

rilFile : str

The filename for the RIL file

projectAllSnps(chrom_length, popIndex, founder_data, boolean=True, positions=None, enumerate_snps=False)[source]

Projects all SNPs on a chromsome onto descendants

Parameters:

chrom_length : int

The length of the chromosome

popIndex : np.ndarray, int

Indices of the population for each sample

founder_data : genomfart.parsers.SNPdata

Object containing the founder SNP data

boolean : boolean

Whether the parents returned have boolean values for SNPs

positions : set of ints

A set of specific positions to project. If None, all SNPs will be projected

enumerate_snps : boolean

Whether to return both the absolute value of the SNP index along with the positions

Returns:

Generator non-reference allele count for each RIL, as an array of

doubles

Examples

>>> from genomfart.utils.snp_projector import snp_projector
>>> from genomfart.parsers.SNPdata import SNPdata
>>> import numpy as np
>>> import re
>>> import os
>>> chrom = 10
>>> chrom_length = 149686046
>>> home_dir = os.path.expanduser('~')
>>> map_dir = os.path.join(home_dir,'Documents','Zea','Data','Sequence',
... 'NAM_imputed_maps','imputedMarkers.0.1cm')
>>> founder_dir = os.path.join(home_dir,'Documents','Zea','Data','Sequence',
... 'NAM_founder_genos')
>>> founder_file = os.path.join(founder_dir,
... 'gwas_snps_union_chr%d.h1.h2_minors_indels.cnv_genic_2kwin_500bpbin.20130605.txt' % chrom)
>>> map_file = os.path.join(map_dir,'Imputed_0.1cm_master.txt')
>>> ril_file = os.path.join(map_dir,'Imputed_0.1cm_chr%d_clean.txt' % chrom)
>>> projector = snp_projector(chrom,chrom_length,map_file,ril_file)
>>> popIndex = np.array(map(lambda x: (17 if x[0] == 'M' else int(re.search('(?<=Z)\d+(?=E)',x).group()))-1,)
... projector.samp_names), dtype=np.int64)
>>> founder_data = SNPdata(chrom, founder_file, 'B73')
>>> projection = projector.projectAllSnps(chrom_length, popIndex, founder_data)
projectSnpBoolean(parents, pos, chrom_length, popIndex)[source]

Projects a SNP onto descendants if parent values are boolean

Parameters:

parents : np.ndarray, boolean

Array of parent genotypes

pos : int

Position of the SNP

popIndex : np.ndarray, int

Indices of the population for each sample

genomfart.utils.version_mapper module

class genomfart.utils.version_mapper.version_mapper(map_file)[source]

Attributes

graph  

Methods

v1_to_v2_map(v1_chr, v1_pos) Gets the position of the version 2 genome if it exists (base 1 assumed)
v1_to_v2_seg_map(v1_chr, v1_start, v1_end) Gets the ranges of positions in the version 2 genome if they exist (base 1 assumed)
v2_to_v1_map(v2_chr, v2_pos) Gets the position of the version 1 genome if it exists (base 1 assumed)
v2_to_v1_seg_map(v2_chr, v2_start, v2_end) Gets the ranges of positions in the version 1 genome if they exist (base 1 assumed)
__init__(map_file)[source]

Instantiates the version mapper

Parameters:

map_file : str

A file that maps between the two assemblies. It should have columns v1_chrom, v1_start, v1_end, v2_chrom, v2_start, v2_end, orientation. It is also assumed to be base-1

graph = None
v1_end_dict = {}
v1_to_v2_map(v1_chr, v1_pos)[source]

Gets the position of the version 2 genome if it exists (base 1 assumed)

Parameters:

v1_chr : hashable

The version 1 chromosome

v1_pos : int

The version 1 position

Returns:

(v2_chrom,v2_pos,orientation relative to v1) if the position exists in version 2

or None if it doesn’t exist in version 2

v1_to_v2_seg_map(v1_chr, v1_start, v1_end)[source]

Gets the ranges of positions in the version 2 genome if they exist (base 1 assumed)

Parameters:

v1_chr : hashable

The version 1 chromosome

v1_start : int

The version 1 start position (inclusive)

v1_end : int

The version 1 end position (inclusive)

Returns:

Dictionary of (v1_chrom,v1_start,v1_end)->(v2_chrom,v2_start,v2_end,orientatoin relative to v1) for

ranges of positions existing in version 2

v2_end_dict = {}
v2_to_v1_map(v2_chr, v2_pos)[source]

Gets the position of the version 1 genome if it exists (base 1 assumed)

Parameters:

v2_chr : hashable

The version 2 chromosome

v2_pos : int

The version 2 position

Returns:

(v1_chrom,v1_pos,orientation relative to v2) if the position exists in version 1

or None if it doesn’t exist in version 1

v2_to_v1_seg_map(v2_chr, v2_start, v2_end)[source]

Gets the ranges of positions in the version 1 genome if they exist (base 1 assumed)

Parameters:

v2_chr : hashable

The version 2 chromosome

v2_start : int

The version 2 start position (inclusive)

v2_end : int

The version 2 end position (inclusive)

Returns:

Dictionary of (v2_chrom,v2_start,v2_end)->(v1_chrom,v1_start,v1_end,orientation relative to v2) for

ranges of positions existing in version 1

Module contents