genomfart.parsers package

Submodules

genomfart.parsers.AGPmap module

class genomfart.parsers.AGPmap.AGPMap(mapFile, useAgpV2=False)[source]

Class used to parse and manipulate data from an AGPmap with cM positions.

The map can be in 1 of 2 different formats, each of which is tab-delimited

The first has the chromosome first, the marker third, the cM position 4th, the AGP position 5th

The second has the marker first, the chromosome second, the cM position third, and the AGP position 6th

Methods

getCmFromPosition(chromosome, position) Gets cM position from chromosome bp
getFirstGeneticPosition(chromosome) Gets the first genetic position on a chromosome
getFirstMarkerName(chromosome) Gets the name of the first marker
getFlankingMarkerIndices(chromosome, ...) Gets the indices of the markers flanking a given genetic position
getGeneticPos(markerIndex) Gets the genetic position of a marker
getInterval(chromosome, position) Gets the markers, position of marker in bp flanking a position
getLastGeneticPosition(chromosome) Gets the last genetic position on a chromosome
getLastMarkerName(chromosome) Gets the name of the last marker
getMarkerNumber(marker_name) Gets the number of the marker
getPhysPos(markerIndex) Gets the physical position of a marker
getPositionFromCm(chromosome, cM) Gets chromosome bp from cM position
__init__(mapFile, useAgpV2=False)[source]

Instantiates the AGPMap parser

Parameters:

mapFile : str

The filename for the map

useAgpV2 : boolean

Whether this is version 2 of the AGPmap format

getCmFromPosition(chromosome, position)[source]

Gets cM position from chromosome bp

Parameters:

chromosome : int

The chromosome

position : int

The position

Returns:

cM position

getFirstGeneticPosition(chromosome)[source]

Gets the first genetic position on a chromosome

Parameters:

chromosome : int

The chromosome

Returns:

The first genetic position (in cM)

getFirstMarkerName(chromosome)[source]

Gets the name of the first marker

Parameters:

chromosome : int

The chromosome to get the first marker from

Returns:

The name of the first marker

getFlankingMarkerIndices(chromosome, geneticPosition)[source]

Gets the indices of the markers flanking a given genetic position

Parameters:

chromosome : int

The chromosome

geneticPosition : float

genetic position in cM

Returns:

left flank index, right flank index

getGeneticPos(markerIndex)[source]

Gets the genetic position of a marker

Parameters:

markerIndex : int

Index of the marker

Returns:

Genetic position of the marker

getInterval(chromosome, position)[source]

Gets the markers, position of marker in bp flanking a position on the chromosome

Parameters:

chromosome : int

The chromosome

position : int

The position

Returns:

(left marker, left marker position, right marker, right marker position)

getLastGeneticPosition(chromosome)[source]

Gets the last genetic position on a chromosome

Parameters:

chromosome : int

The chromosome

Returns:

The last genetic position (in cM)

getLastMarkerName(chromosome)[source]

Gets the name of the last marker

Parameters:

chromosome : int

The chromosome to get the last marker from

Returns:

The name of the last marker

getMarkerNumber(marker_name)[source]

Gets the number of the marker

Parameters:

marker_name : str

The name of the marker

Returns:

Number of the marker

getPhysPos(markerIndex)[source]

Gets the physical position of a marker

Parameters:

markerIndex : int

Index of the marker

Returns:

Physical position of marker

getPositionFromCm(chromosome, cM)[source]

Gets chromosome bp from cM position

Parameters:

chromosome : int

The chromosome

cM : float

The position in cM

Returns:

The chromosome bp position

genomfart.parsers.SNPdata module

class genomfart.parsers.SNPdata.SNPdata(chromosome, snp_file, ref_samp, samp_start_col=11, totalSNPnumber=None)[source]

Methods

findTotalSnpNumber() Finds the total number of SNPs in the file
getAllele() Gets the allele configuration
getGenotype([skip_ref]) Gets the genotype of the founders at the current SNP
getNumberOfSnps() Gets the number of SNPs in the file
getPosition() Gets the current position on the chromosome
next() Reads the next line of the SNPs file
reset() Resets the reader to the first data line
__init__(chromosome, snp_file, ref_samp, samp_start_col=11, totalSNPnumber=None)[source]

Instantiates a reader of SNP data on founders

Parameters:

chromosome : int

The chromosome

snp_file : str

Name of file containing SNP data for the founders

ref_samp : str

The name of the reference sample (one of the columns, the one containing the allele state for 0)

findTotalSnpNumber()[source]

Finds the total number of SNPs in the file

getAllele()[source]

Gets the allele configuration

Returns:The allele configuration
getGenotype(skip_ref=True)[source]

Gets the genotype of the founders at the current SNP

Parameters:

skip_ref : boolean

Whether to skip the reference sample

Returns:

Array of genotypes for each of the founders

getNumberOfSnps()[source]

Gets the number of SNPs in the file

Returns:The number of SNPs in the file
getPosition()[source]

Gets the current position on the chromosome

Returns:Position on the chromosome
next()[source]

Reads the next line of the SNPs file

reset()[source]

Resets the reader to the first data line

genomfart.parsers.gff module

class genomfart.parsers.gff.gff_parser(gff_file, exclude_types=None)[source]

Bases: genomfart.utils.genomeAnnotationGraph.genomeAnnotationGraph

Class used to parse and analyze GFF (version 3) files. The class represents any hierarchical structure as a directed graph and puts the individual pieces into a RangeBucketMap

All coordinates are 1-based

Examples

>>> from genomfart.parsers.gff import gff_parser
>>> from genomfart.data.data_constants import GFF_TEST_FILE
>>> parser = gff_parser(GFF_TEST_FILE)
>>> parser.get_overlapping_element_ids('Pt',100,4000)
set(['three_prime_UTR:Pt_3363_4490:-', 'exon:Pt_1674_3308:-',
'CDS:GRMZM5G836994_P01', 'exon:Pt_3363_4708:-',
'transcript:GRMZM5G811749_T01', 'transcript:GRMZM5G836994_T01',
'repeat_region:Pt_3550_3560:?', 'repeat_region:Pt_3683_3696:?',
'gene:GRMZM5G811749', 'gene:GRMZM5G836994', 'repeat_region:Pt_320_1262:+',
'repeat_region:Pt_3764_3775:?'])
>>> gene_info = parser.get_element_info('gene:GRMZM5G811749')
>>> gene_info
{'Ranges': [[3363 , 5604]], 'attributes': [{'logic_name': 'genebuilder',
'external_name': 'RPS16',
'description': '30S ribosomal protein S16%2C chloroplastic  [Source:UniProtKB/Swiss-Prot%3BAcc:P27723]', 'ID': 'gene:GRMZM5G811749',
'biotype': 'protein_coding'}], 'seqid': 'Pt', 'type': 'gene', 'strand': '-'}
>>> [x for x in parser.get_element_ids_of_type('Pt','gene',start=100,end=4000)]
['gene:GRMZM5G836994', 'gene:GRMZM5G811749']

Methods

add_annotation(element_id, seqid, start, ...) Adds an annotation to the genome
add_node_annotations(element_id, **annots) Adds annotations to all annotation dictionaries under a node.
get_aa_indices(seqid, pos[, cds_type]) Gets the indices (base-1) of the amino acid position in any
get_cds_indices(seqid, pos[, cds_type]) Gets the indices (base-1) of the nucleotide position in any
get_closest_element_id(seqid, rangeStart, ...) Gets the element id(s) of the whatever element is closest to a range
get_closest_element_id_of_type(seqid, ...[, ...]) Gets the element id(s) of the whatever element is closest to a range
get_codon_position(seqid, pos[, cds_type]) Gets the indices (base-1) of the codon position (i.e.
get_element_children_ids(element_id) Gets the ids of the children of an element
get_element_ids_of_type(seqid, element_type) Gets element ids of some type along a coordinate system
get_element_info(element_id) Gets information on a particular element
get_element_parent_ids(element_id) Gets the ids of the parents of an element
get_overlapping_element_ids(seqid, start, end) Gets the ids for any elements that overlap a given range
get_overlapping_element_ids_of_type(seqid, ...) Gets the ids for any elements that overlap a given range and are
overlaps_type(seqid, start, end, element_type) Returns True if the given range overlaps an element of the given type
__init__(gff_file, exclude_types=None)[source]

Instantiates the gff file

Parameters:

gff_file : str

The filename of a gff file

exclude_types : set

The names of types (e.g. ‘repeat’) that you don’t want to store

Raises:

IOError

If the file isn’t correctly formatted

genomfart.parsers.vcf module

class genomfart.parsers.vcf.VCF_parser(vcf_file)[source]

Parser for VCF files

Methods

get_affected_ref_bases(vcf_pos, ref_allele, ...) Gets the reference positions that are modified through the alternative allele, either through substitution or deletion.
get_nw_aligned_alleles(ref_allele, alt_allele) Aligns alleles using the Needleman-Wunsch global alignment algorithm
get_substituted_ref_bases(vcf_pos, ...) Gets the reference positions that are modified through the alternative allele through substitution.
get_substituted_ref_bases_nw(vcf_pos, ...[, ...]) Gets the reference positions that are modified through the alternative allele through substitution.
parse_geno_depths() Iterates through the VCF file, getting the genotypes at each position
parse_select_geno_depths(genos[, info_dict, ...]) Iterates through the VCF file, getting selected genotypes at each position.
parse_select_geno_generic(genos[, ...]) Iterates through the VCF file, getting selected genotypes at each position.
parse_site_infos([filter_excludes, ...]) Iterates through the VCF file, getting the info for each site
__init__(vcf_file)[source]

Instantiates a parser for the VCF file

Parameters:

vcf_file : str

The path to a VCF file (May or may not be gzipped)

static get_affected_ref_bases(vcf_pos, ref_allele, alt_allele)[source]

Gets the reference positions that are modified through the alternative allele, either through substitution or deletion. Note that this assumes that the first bases of the ref and alt alleles line up and that there is, at most, one indel in the alt_allele.

Parameters:

vcf_pos : int

The position given for this variant in the VCF file

ref_allele : str

The reference allele given by the VCF file

alt_allele : str

The alternative allele given by the VCF file

Returns:

Set of positions that are either substituted or deleted in the

alternative alleles

Examples

>>> VCF_parser.get_affected_ref_bases(20,'C','T')
set([20])
>>> VCF_parser.get_affected_ref_bases(20,'C','CTAG')
set([])
>>> VCF_parser.get_affected_ref_bases(20,'TCG','T')
set([21, 22])
>>> VCF_parser.get_affected_ref_bases(20,'TCGCG','TCG')
set([24, 23])
>>> VCF_parser.get_affected_ref_bases(20,'TCGCG','TCGCGCG')
set([])
static get_nw_aligned_alleles(ref_allele, alt_allele, match=1, mismatch=-2, gapopen=-4, gapextend=-1)[source]

Aligns alleles using the Needleman-Wunsch global alignment algorithm

Note that this will only return 1 of the alignments with the given score Parameters. This also assumes that the VCF always has the first bases of the allele aligned. ———-

ref_allele
: str
The reference allele given by the VCF file
alt_allele
: str
The alternative allele given by the VCF file
match
: number
Score for matching a base
mismatch
: number
Score for mismatching a base
gapopen
: number
Score for opening a gap
gapextend
: number
Score for extending a gap
Returns:aligned_seq1, aligned_seq2, score

Examples

>>> VCF_parser.get_nw_aligned_alleles('C','T')
('C', 'T', -2)
>>> VCF_parser.get_nw_aligned_alleles('C','CTAG')
('C---', 'CTAG', -5)
>>> VCF_parser.get_nw_aligned_alleles('TCG','T')
('TCG', 'T--', -4)
>>> VCF_parser.get_nw_aligned_alleles('TCGCG','TCG')
('TCGCG', 'TC--G', -2.0)
>>> VCF_parser.get_nw_aligned_alleles('TCGCG','TCGGCGCG')
('TCG---CG', 'TCGGCGCG', -1.0)
>>> VCF_parser.get_nw_aligned_alleles('TCGCG','TCGCGGCG')
('TCGCG---', 'TCGCGGCG', -1.0)
static get_substituted_ref_bases(vcf_pos, ref_allele, alt_allele)[source]

Gets the reference positions that are modified through the alternative allele through substitution.

Note that this assumes that the first bases of the ref and alt alleles line up and that there is, at most, one indel in the alt_allele.

Parameters:

vcf_pos : int

The position given for this variant in the VCF file

ref_allele : str

The reference allele given by the VCF file

alt_allele : str

The alternative allele given by the VCF file

Returns:

Set of positions that are either substituted in the alternative allele

Examples

>>> VCF_parser.get_substituted_ref_bases(20,'C','T')
set([20])
>>> VCF_parser.get_substituted_ref_bases(20,'C','CTAG')
set([])
>>> VCF_parser.get_substituted_ref_bases(20,'TCG','T')
set([])
>>> VCF_parser.get_substituted_ref_bases(20,'TCGCG','TCG')
set([])
>>> VCF_parser.get_substituted_ref_bases(20,'TCGCG','TCGGCGCG')
set([24, 23])
>>> VCF_parser.get_substituted_ref_bases(20,'TCGCG','TCGCGGCG')
set([])
static get_substituted_ref_bases_nw(vcf_pos, ref_allele, alt_allele, match=1, mismatch=-2, gapopen=-4, gapextend=-1)[source]

Gets the reference positions that are modified through the alternative allele through substitution. This is accomplished by first running a Needleman-Wunsch alignment of the 2 alleles and then finding the substitutions

Note that this assumes that the first bases of the ref and alt alleles line up

Parameters:

vcf_pos : int

The position given for this variant in the VCF file

ref_allele : str

The reference allele given by the VCF file

alt_allele : str

The alternative allele given by the VCF file

match : number

Score for matching a base

mismatch : number

Score for mismatching a base

gapopen : number

Score for opening a gap

gapextend : number

Score for extending a gap

Returns:

Set of positions that are either substituted in the alternative allele

Examples

>>> VCF_parser.get_substituted_ref_bases_nw(20,'C','T')
set([20])
>>> VCF_parser.get_substituted_ref_bases_nw(20,'C','CTAG')
set([])
>>> VCF_parser.get_substituted_ref_bases_nw(20,'TCG','T')
set([])
>>> VCF_parser.get_substituted_ref_bases_nw(20,'TCGCG','TCG')
set([])
>>> VCF_parser.get_substituted_ref_bases_nw(20,'TCGCG','TCGGCGCG')
set([])
>>> VCF_parser.get_substituted_ref_bases_nw(20,'TCGCG','TCGCGGCG')
set([])
parse_geno_depths()[source]

Iterates through the VCF file, getting the genotypes at each position

Returns:

A generator that generates tuples of chrom,pos,(ref,alt1,alt2,...),

{sample->(base_depths)}

parse_select_geno_depths(genos, info_dict=False, use_chrom=None, start=None, end=None)[source]

Iterates through the VCF file, getting selected genotypes at each position. Note thtat this assumes samples contain depths

Parameters:

genos : list

The names of the genotypes you want

info_dict : boolean

Whether you want the info dict on the end of the return

use_chrom : str, optional

Optional chromosome on which to start the scan

start : int, optional

Optional place to start the scan. If chrom also specified, it will be within the chromsoome. Otherwise, it will be within the first chromosome to have at least the start point

end : int, optional

Optional place to end the scan. (Nested within chrom if specified)

Returns:

A generator that generates tuples of chrom,pos,(ref,alt1,alt2,...),{sample->base_depths),

<info_dict if desired>}

parse_select_geno_generic(genos, info_dict=False, use_chrom=None, start=None, end=None, filter_excludes=None, filter_requires=None)[source]

Iterates through the VCF file, getting selected genotypes at each position. This makes no assumptions about the format of the sample information for each genotype

Parameters:

genos : list

The names of the genotypes you want

info_dict : dict

Whether you want the info dict on the end of the return

use_chrom : str, optional

Optional chromosome on which to start the scan

start : int, optional

Optional place to start the scan

end : int, optional

Optional place to end the scan

filter_excludes : set, optional

Filter tags that should exclude the locus from being returned

filter_requires : set, optional

Filter tags that should be required for a locus to be returned

Returns:

A generator that generates tuples of chrom,pos,(ref,alt1,alt2,...),{sample->{prefix->val}),

<info_dict if desired>}

parse_site_infos(filter_excludes=None, filter_requires=None)[source]

Iterates through the VCF file, getting the info for each site

Parameters:

filter_excludes : set, optional

Filter tags that should exclude the locus from being returned

filter_requires : set, optional

Filter tags that should be required for a locus to be returned

Returns:

A generator that generates tuples of chrom,pos,{ref,alt1,alt2,...),{field->val}.

Fields without a corresponding value will have value “None”

Module contents