genomfart.parsers package¶

Submodules¶

genomfart.parsers.AGPmap module¶

class genomfart.parsers.AGPmap.AGPMap(mapFile, useAgpV2=False)[source]¶

Class used to parse and manipulate data from an AGPmap with cM positions.

The map can be in 1 of 2 different formats, each of which is tab-delimited

The first has the chromosome first, the marker third, the cM position 4th, the AGP position 5th

The second has the marker first, the chromosome second, the cM position third, and the AGP position 6th

Methods

`getCmFromPosition`(chromosome, position)	Gets cM position from chromosome bp
`getFirstGeneticPosition`(chromosome)	Gets the first genetic position on a chromosome
`getFirstMarkerName`(chromosome)	Gets the name of the first marker
`getFlankingMarkerIndices`(chromosome, ...)	Gets the indices of the markers flanking a given genetic position
`getGeneticPos`(markerIndex)	Gets the genetic position of a marker
`getInterval`(chromosome, position)	Gets the markers, position of marker in bp flanking a position
`getLastGeneticPosition`(chromosome)	Gets the last genetic position on a chromosome
`getLastMarkerName`(chromosome)	Gets the name of the last marker
`getMarkerNumber`(marker_name)	Gets the number of the marker
`getPhysPos`(markerIndex)	Gets the physical position of a marker
`getPositionFromCm`(chromosome, cM)	Gets chromosome bp from cM position

__init__(mapFile, useAgpV2=False)[source]¶

Instantiates the AGPMap parser

Parameters:

mapFile : str

The filename for the map

useAgpV2 : boolean

Whether this is version 2 of the AGPmap format

getCmFromPosition(chromosome, position)[source]¶

Gets cM position from chromosome bp

Parameters:

chromosome : int

The chromosome

position : int

The position

Returns:

cM position

getFirstGeneticPosition(chromosome)[source]¶

Gets the first genetic position on a chromosome

Parameters:

chromosome : int

The chromosome

Returns:

The first genetic position (in cM)

getFirstMarkerName(chromosome)[source]¶

Gets the name of the first marker

Parameters:

chromosome : int

The chromosome to get the first marker from

Returns:

The name of the first marker

getFlankingMarkerIndices(chromosome, geneticPosition)[source]¶

Gets the indices of the markers flanking a given genetic position

Parameters:

chromosome : int

The chromosome

geneticPosition : float

genetic position in cM

Returns:

left flank index, right flank index

getGeneticPos(markerIndex)[source]¶

Gets the genetic position of a marker

Parameters:

markerIndex : int

Index of the marker

Returns:

Genetic position of the marker

getInterval(chromosome, position)[source]¶

Gets the markers, position of marker in bp flanking a position on the chromosome

Parameters:

chromosome : int

The chromosome

position : int

The position

Returns:

(left marker, left marker position, right marker, right marker position)

getLastGeneticPosition(chromosome)[source]¶

Gets the last genetic position on a chromosome

Parameters:

chromosome : int

The chromosome

Returns:

The last genetic position (in cM)

getLastMarkerName(chromosome)[source]¶

Gets the name of the last marker

Parameters:

chromosome : int

The chromosome to get the last marker from

Returns:

The name of the last marker

getMarkerNumber(marker_name)[source]¶

Gets the number of the marker

Parameters:

marker_name : str

The name of the marker

Returns:

Number of the marker

getPhysPos(markerIndex)[source]¶

Gets the physical position of a marker

Parameters:

markerIndex : int

Index of the marker

Returns:

Physical position of marker

getPositionFromCm(chromosome, cM)[source]¶

Gets chromosome bp from cM position

Parameters:

chromosome : int

The chromosome

cM : float

The position in cM

Returns:

The chromosome bp position

genomfart.parsers.SNPdata module¶

class genomfart.parsers.SNPdata.SNPdata(chromosome, snp_file, ref_samp, samp_start_col=11, totalSNPnumber=None)[source]¶

Methods

`findTotalSnpNumber`()	Finds the total number of SNPs in the file
`getAllele`()	Gets the allele configuration
`getGenotype`([skip_ref])	Gets the genotype of the founders at the current SNP
`getNumberOfSnps`()	Gets the number of SNPs in the file
`getPosition`()	Gets the current position on the chromosome
`next`()	Reads the next line of the SNPs file
`reset`()	Resets the reader to the first data line

__init__(chromosome, snp_file, ref_samp, samp_start_col=11, totalSNPnumber=None)[source]¶

Instantiates a reader of SNP data on founders

Parameters:

chromosome : int

The chromosome

snp_file : str

Name of file containing SNP data for the founders

ref_samp : str

The name of the reference sample (one of the columns, the one containing the allele state for 0)

findTotalSnpNumber()[source]¶: Finds the total number of SNPs in the file

getAllele()[source]¶

Gets the allele configuration

Returns:	The allele configuration

getGenotype(skip_ref=True)[source]¶

Gets the genotype of the founders at the current SNP

Parameters:

skip_ref : boolean

Whether to skip the reference sample

Returns:

Array of genotypes for each of the founders

getNumberOfSnps()[source]¶

Gets the number of SNPs in the file

Returns:	The number of SNPs in the file

getPosition()[source]¶

Gets the current position on the chromosome

Returns:	Position on the chromosome

next()[source]¶: Reads the next line of the SNPs file

reset()[source]¶: Resets the reader to the first data line

genomfart.parsers.gff module¶

class genomfart.parsers.gff.gff_parser(gff_file, exclude_types=None)[source]¶

Bases: genomfart.utils.genomeAnnotationGraph.genomeAnnotationGraph

Class used to parse and analyze GFF (version 3) files. The class represents any hierarchical structure as a directed graph and puts the individual pieces into a RangeBucketMap

All coordinates are 1-based

Examples

>>> from genomfart.parsers.gff import gff_parser
>>> from genomfart.data.data_constants import GFF_TEST_FILE
>>> parser = gff_parser(GFF_TEST_FILE)
>>> parser.get_overlapping_element_ids('Pt',100,4000)
set(['three_prime_UTR:Pt_3363_4490:-', 'exon:Pt_1674_3308:-',
'CDS:GRMZM5G836994_P01', 'exon:Pt_3363_4708:-',
'transcript:GRMZM5G811749_T01', 'transcript:GRMZM5G836994_T01',
'repeat_region:Pt_3550_3560:?', 'repeat_region:Pt_3683_3696:?',
'gene:GRMZM5G811749', 'gene:GRMZM5G836994', 'repeat_region:Pt_320_1262:+',
'repeat_region:Pt_3764_3775:?'])
>>> gene_info = parser.get_element_info('gene:GRMZM5G811749')
>>> gene_info
{'Ranges': [[3363 , 5604]], 'attributes': [{'logic_name': 'genebuilder',
'external_name': 'RPS16',
'description': '30S ribosomal protein S16%2C chloroplastic  [Source:UniProtKB/Swiss-Prot%3BAcc:P27723]', 'ID': 'gene:GRMZM5G811749',
'biotype': 'protein_coding'}], 'seqid': 'Pt', 'type': 'gene', 'strand': '-'}
>>> [x for x in parser.get_element_ids_of_type('Pt','gene',start=100,end=4000)]
['gene:GRMZM5G836994', 'gene:GRMZM5G811749']

Methods

`add_annotation`(element_id, seqid, start, ...)	Adds an annotation to the genome
`add_node_annotations`(element_id, **annots)	Adds annotations to all annotation dictionaries under a node.
`get_aa_indices`(seqid, pos[, cds_type])	Gets the indices (base-1) of the amino acid position in any
`get_cds_indices`(seqid, pos[, cds_type])	Gets the indices (base-1) of the nucleotide position in any
`get_closest_element_id`(seqid, rangeStart, ...)	Gets the element id(s) of the whatever element is closest to a range
`get_closest_element_id_of_type`(seqid, ...[, ...])	Gets the element id(s) of the whatever element is closest to a range
`get_codon_position`(seqid, pos[, cds_type])	Gets the indices (base-1) of the codon position (i.e.
`get_element_children_ids`(element_id)	Gets the ids of the children of an element
`get_element_ids_of_type`(seqid, element_type)	Gets element ids of some type along a coordinate system
`get_element_info`(element_id)	Gets information on a particular element
`get_element_parent_ids`(element_id)	Gets the ids of the parents of an element
`get_overlapping_element_ids`(seqid, start, end)	Gets the ids for any elements that overlap a given range
`get_overlapping_element_ids_of_type`(seqid, ...)	Gets the ids for any elements that overlap a given range and are
`overlaps_type`(seqid, start, end, element_type)	Returns True if the given range overlaps an element of the given type

__init__(gff_file, exclude_types=None)[source]¶

Instantiates the gff file

Parameters:

gff_file : str

The filename of a gff file

exclude_types : set

The names of types (e.g. ‘repeat’) that you don’t want to store

Raises:

IOError

If the file isn’t correctly formatted

genomfart.parsers.vcf module¶

class genomfart.parsers.vcf.VCF_parser(vcf_file)[source]¶

Parser for VCF files

Methods

`get_affected_ref_bases`(vcf_pos, ref_allele, ...)	Gets the reference positions that are modified through the alternative allele, either through substitution or deletion.
`get_nw_aligned_alleles`(ref_allele, alt_allele)	Aligns alleles using the Needleman-Wunsch global alignment algorithm
`get_substituted_ref_bases`(vcf_pos, ...)	Gets the reference positions that are modified through the alternative allele through substitution.
`get_substituted_ref_bases_nw`(vcf_pos, ...[, ...])	Gets the reference positions that are modified through the alternative allele through substitution.
`parse_geno_depths`()	Iterates through the VCF file, getting the genotypes at each position
`parse_select_geno_depths`(genos[, info_dict, ...])	Iterates through the VCF file, getting selected genotypes at each position.
`parse_select_geno_generic`(genos[, ...])	Iterates through the VCF file, getting selected genotypes at each position.
`parse_site_infos`([filter_excludes, ...])	Iterates through the VCF file, getting the info for each site

__init__(vcf_file)[source]¶

Instantiates a parser for the VCF file

Parameters:

vcf_file : str

The path to a VCF file (May or may not be gzipped)

static get_affected_ref_bases(vcf_pos, ref_allele, alt_allele)[source]¶

Gets the reference positions that are modified through the alternative allele, either through substitution or deletion. Note that this assumes that the first bases of the ref and alt alleles line up and that there is, at most, one indel in the alt_allele.

Parameters:

vcf_pos : int

The position given for this variant in the VCF file

ref_allele : str

The reference allele given by the VCF file

alt_allele : str

The alternative allele given by the VCF file

Returns:

Set of positions that are either substituted or deleted in the

alternative alleles

Examples

>>> VCF_parser.get_affected_ref_bases(20,'C','T')
set([20])
>>> VCF_parser.get_affected_ref_bases(20,'C','CTAG')
set([])
>>> VCF_parser.get_affected_ref_bases(20,'TCG','T')
set([21, 22])
>>> VCF_parser.get_affected_ref_bases(20,'TCGCG','TCG')
set([24, 23])
>>> VCF_parser.get_affected_ref_bases(20,'TCGCG','TCGCGCG')
set([])

static get_nw_aligned_alleles(ref_allele, alt_allele, match=1, mismatch=-2, gapopen=-4, gapextend=-1)[source]¶

Aligns alleles using the Needleman-Wunsch global alignment algorithm

Note that this will only return 1 of the alignments with the given score Parameters. This also assumes that the VCF always has the first bases of the allele aligned. ———-

ref_allele: The reference allele given by the VCF file
alt_allele: The alternative allele given by the VCF file
match: Score for matching a base
mismatch: Score for mismatching a base
gapopen: Score for opening a gap
gapextend: Score for extending a gap

Returns:	aligned_seq1, aligned_seq2, score

Examples

>>> VCF_parser.get_nw_aligned_alleles('C','T')
('C', 'T', -2)
>>> VCF_parser.get_nw_aligned_alleles('C','CTAG')
('C---', 'CTAG', -5)
>>> VCF_parser.get_nw_aligned_alleles('TCG','T')
('TCG', 'T--', -4)
>>> VCF_parser.get_nw_aligned_alleles('TCGCG','TCG')
('TCGCG', 'TC--G', -2.0)
>>> VCF_parser.get_nw_aligned_alleles('TCGCG','TCGGCGCG')
('TCG---CG', 'TCGGCGCG', -1.0)
>>> VCF_parser.get_nw_aligned_alleles('TCGCG','TCGCGGCG')
('TCGCG---', 'TCGCGGCG', -1.0)

static get_substituted_ref_bases(vcf_pos, ref_allele, alt_allele)[source]¶

Gets the reference positions that are modified through the alternative allele through substitution.

Note that this assumes that the first bases of the ref and alt alleles line up and that there is, at most, one indel in the alt_allele.

Parameters:

vcf_pos : int

The position given for this variant in the VCF file

ref_allele : str

The reference allele given by the VCF file

alt_allele : str

The alternative allele given by the VCF file

Returns:

Set of positions that are either substituted in the alternative allele

Examples

>>> VCF_parser.get_substituted_ref_bases(20,'C','T')
set([20])
>>> VCF_parser.get_substituted_ref_bases(20,'C','CTAG')
set([])
>>> VCF_parser.get_substituted_ref_bases(20,'TCG','T')
set([])
>>> VCF_parser.get_substituted_ref_bases(20,'TCGCG','TCG')
set([])
>>> VCF_parser.get_substituted_ref_bases(20,'TCGCG','TCGGCGCG')
set([24, 23])
>>> VCF_parser.get_substituted_ref_bases(20,'TCGCG','TCGCGGCG')
set([])

static get_substituted_ref_bases_nw(vcf_pos, ref_allele, alt_allele, match=1, mismatch=-2, gapopen=-4, gapextend=-1)[source]¶

Gets the reference positions that are modified through the alternative allele through substitution. This is accomplished by first running a Needleman-Wunsch alignment of the 2 alleles and then finding the substitutions

Note that this assumes that the first bases of the ref and alt alleles line up

Parameters:

vcf_pos : int

The position given for this variant in the VCF file

ref_allele : str

The reference allele given by the VCF file

alt_allele : str

The alternative allele given by the VCF file

match : number

Score for matching a base

mismatch : number

Score for mismatching a base

gapopen : number

Score for opening a gap

gapextend : number

Score for extending a gap

Returns:

Set of positions that are either substituted in the alternative allele

Examples

>>> VCF_parser.get_substituted_ref_bases_nw(20,'C','T')
set([20])
>>> VCF_parser.get_substituted_ref_bases_nw(20,'C','CTAG')
set([])
>>> VCF_parser.get_substituted_ref_bases_nw(20,'TCG','T')
set([])
>>> VCF_parser.get_substituted_ref_bases_nw(20,'TCGCG','TCG')
set([])
>>> VCF_parser.get_substituted_ref_bases_nw(20,'TCGCG','TCGGCGCG')
set([])
>>> VCF_parser.get_substituted_ref_bases_nw(20,'TCGCG','TCGCGGCG')
set([])

parse_geno_depths()[source]¶

Iterates through the VCF file, getting the genotypes at each position

Returns:

A generator that generates tuples of chrom,pos,(ref,alt1,alt2,...),

{sample->(base_depths)}

parse_select_geno_depths(genos, info_dict=False, use_chrom=None, start=None, end=None)[source]¶

Iterates through the VCF file, getting selected genotypes at each position. Note thtat this assumes samples contain depths

Parameters:

genos : list

The names of the genotypes you want

info_dict : boolean

Whether you want the info dict on the end of the return

use_chrom : str, optional

Optional chromosome on which to start the scan

start : int, optional

Optional place to start the scan. If chrom also specified, it will be within the chromsoome. Otherwise, it will be within the first chromosome to have at least the start point

end : int, optional

Optional place to end the scan. (Nested within chrom if specified)

Returns:

A generator that generates tuples of chrom,pos,(ref,alt1,alt2,...),{sample->base_depths),

<info_dict if desired>}

parse_select_geno_generic(genos, info_dict=False, use_chrom=None, start=None, end=None, filter_excludes=None, filter_requires=None)[source]¶

Iterates through the VCF file, getting selected genotypes at each position. This makes no assumptions about the format of the sample information for each genotype

Parameters:

genos : list

The names of the genotypes you want

info_dict : dict

Whether you want the info dict on the end of the return

use_chrom : str, optional

Optional chromosome on which to start the scan

start : int, optional

Optional place to start the scan

end : int, optional

Optional place to end the scan

filter_excludes : set, optional

Filter tags that should exclude the locus from being returned

filter_requires : set, optional

Filter tags that should be required for a locus to be returned

Returns:

A generator that generates tuples of chrom,pos,(ref,alt1,alt2,...),{sample->{prefix->val}),

<info_dict if desired>}

parse_site_infos(filter_excludes=None, filter_requires=None)[source]¶

Iterates through the VCF file, getting the info for each site

Parameters:

filter_excludes : set, optional

Filter tags that should exclude the locus from being returned

filter_requires : set, optional

Filter tags that should be required for a locus to be returned

Returns:

A generator that generates tuples of chrom,pos,{ref,alt1,alt2,...),{field->val}.

Fields without a corresponding value will have value “None”

genomfart.parsers package¶

Submodules¶

genomfart.parsers.AGPmap module¶

genomfart.parsers.SNPdata module¶

genomfart.parsers.gff module¶

genomfart.parsers.vcf module¶

Module contents¶

Table Of Contents

Previous topic

This Page