genomfart.parsers package¶
Submodules¶
genomfart.parsers.AGPmap module¶
-
class
genomfart.parsers.AGPmap.AGPMap(mapFile, useAgpV2=False)[source]¶ Class used to parse and manipulate data from an AGPmap with cM positions.
The map can be in 1 of 2 different formats, each of which is tab-delimited
The first has the chromosome first, the marker third, the cM position 4th, the AGP position 5th
The second has the marker first, the chromosome second, the cM position third, and the AGP position 6th
Methods
getCmFromPosition(chromosome, position)Gets cM position from chromosome bp getFirstGeneticPosition(chromosome)Gets the first genetic position on a chromosome getFirstMarkerName(chromosome)Gets the name of the first marker getFlankingMarkerIndices(chromosome, ...)Gets the indices of the markers flanking a given genetic position getGeneticPos(markerIndex)Gets the genetic position of a marker getInterval(chromosome, position)Gets the markers, position of marker in bp flanking a position getLastGeneticPosition(chromosome)Gets the last genetic position on a chromosome getLastMarkerName(chromosome)Gets the name of the last marker getMarkerNumber(marker_name)Gets the number of the marker getPhysPos(markerIndex)Gets the physical position of a marker getPositionFromCm(chromosome, cM)Gets chromosome bp from cM position -
__init__(mapFile, useAgpV2=False)[source]¶ Instantiates the AGPMap parser
Parameters: mapFile : str
The filename for the map
useAgpV2 : boolean
Whether this is version 2 of the AGPmap format
-
getCmFromPosition(chromosome, position)[source]¶ Gets cM position from chromosome bp
Parameters: chromosome : int
The chromosome
position : int
The position
Returns: cM position
-
getFirstGeneticPosition(chromosome)[source]¶ Gets the first genetic position on a chromosome
Parameters: chromosome : int
The chromosome
Returns: The first genetic position (in cM)
-
getFirstMarkerName(chromosome)[source]¶ Gets the name of the first marker
Parameters: chromosome : int
The chromosome to get the first marker from
Returns: The name of the first marker
-
getFlankingMarkerIndices(chromosome, geneticPosition)[source]¶ Gets the indices of the markers flanking a given genetic position
Parameters: chromosome : int
The chromosome
geneticPosition : float
genetic position in cM
Returns: left flank index, right flank index
-
getGeneticPos(markerIndex)[source]¶ Gets the genetic position of a marker
Parameters: markerIndex : int
Index of the marker
Returns: Genetic position of the marker
-
getInterval(chromosome, position)[source]¶ Gets the markers, position of marker in bp flanking a position on the chromosome
Parameters: chromosome : int
The chromosome
position : int
The position
Returns: (left marker, left marker position, right marker, right marker position)
-
getLastGeneticPosition(chromosome)[source]¶ Gets the last genetic position on a chromosome
Parameters: chromosome : int
The chromosome
Returns: The last genetic position (in cM)
-
getLastMarkerName(chromosome)[source]¶ Gets the name of the last marker
Parameters: chromosome : int
The chromosome to get the last marker from
Returns: The name of the last marker
-
getMarkerNumber(marker_name)[source]¶ Gets the number of the marker
Parameters: marker_name : str
The name of the marker
Returns: Number of the marker
-
genomfart.parsers.SNPdata module¶
-
class
genomfart.parsers.SNPdata.SNPdata(chromosome, snp_file, ref_samp, samp_start_col=11, totalSNPnumber=None)[source]¶ Methods
findTotalSnpNumber()Finds the total number of SNPs in the file getAllele()Gets the allele configuration getGenotype([skip_ref])Gets the genotype of the founders at the current SNP getNumberOfSnps()Gets the number of SNPs in the file getPosition()Gets the current position on the chromosome next()Reads the next line of the SNPs file reset()Resets the reader to the first data line -
__init__(chromosome, snp_file, ref_samp, samp_start_col=11, totalSNPnumber=None)[source]¶ Instantiates a reader of SNP data on founders
Parameters: chromosome : int
The chromosome
snp_file : str
Name of file containing SNP data for the founders
ref_samp : str
The name of the reference sample (one of the columns, the one containing the allele state for 0)
-
getGenotype(skip_ref=True)[source]¶ Gets the genotype of the founders at the current SNP
Parameters: skip_ref : boolean
Whether to skip the reference sample
Returns: Array of genotypes for each of the founders
-
getNumberOfSnps()[source]¶ Gets the number of SNPs in the file
Returns: The number of SNPs in the file
-
genomfart.parsers.gff module¶
-
class
genomfart.parsers.gff.gff_parser(gff_file, exclude_types=None)[source]¶ Bases:
genomfart.utils.genomeAnnotationGraph.genomeAnnotationGraphClass used to parse and analyze GFF (version 3) files. The class represents any hierarchical structure as a directed graph and puts the individual pieces into a RangeBucketMap
All coordinates are 1-based
Examples
>>> from genomfart.parsers.gff import gff_parser >>> from genomfart.data.data_constants import GFF_TEST_FILE >>> parser = gff_parser(GFF_TEST_FILE) >>> parser.get_overlapping_element_ids('Pt',100,4000) set(['three_prime_UTR:Pt_3363_4490:-', 'exon:Pt_1674_3308:-', 'CDS:GRMZM5G836994_P01', 'exon:Pt_3363_4708:-', 'transcript:GRMZM5G811749_T01', 'transcript:GRMZM5G836994_T01', 'repeat_region:Pt_3550_3560:?', 'repeat_region:Pt_3683_3696:?', 'gene:GRMZM5G811749', 'gene:GRMZM5G836994', 'repeat_region:Pt_320_1262:+', 'repeat_region:Pt_3764_3775:?']) >>> gene_info = parser.get_element_info('gene:GRMZM5G811749') >>> gene_info {'Ranges': [[3363 , 5604]], 'attributes': [{'logic_name': 'genebuilder', 'external_name': 'RPS16', 'description': '30S ribosomal protein S16%2C chloroplastic [Source:UniProtKB/Swiss-Prot%3BAcc:P27723]', 'ID': 'gene:GRMZM5G811749', 'biotype': 'protein_coding'}], 'seqid': 'Pt', 'type': 'gene', 'strand': '-'} >>> [x for x in parser.get_element_ids_of_type('Pt','gene',start=100,end=4000)] ['gene:GRMZM5G836994', 'gene:GRMZM5G811749']
Methods
add_annotation(element_id, seqid, start, ...)Adds an annotation to the genome add_node_annotations(element_id, **annots)Adds annotations to all annotation dictionaries under a node. get_aa_indices(seqid, pos[, cds_type])Gets the indices (base-1) of the amino acid position in any get_cds_indices(seqid, pos[, cds_type])Gets the indices (base-1) of the nucleotide position in any get_closest_element_id(seqid, rangeStart, ...)Gets the element id(s) of the whatever element is closest to a range get_closest_element_id_of_type(seqid, ...[, ...])Gets the element id(s) of the whatever element is closest to a range get_codon_position(seqid, pos[, cds_type])Gets the indices (base-1) of the codon position (i.e. get_element_children_ids(element_id)Gets the ids of the children of an element get_element_ids_of_type(seqid, element_type)Gets element ids of some type along a coordinate system get_element_info(element_id)Gets information on a particular element get_element_parent_ids(element_id)Gets the ids of the parents of an element get_overlapping_element_ids(seqid, start, end)Gets the ids for any elements that overlap a given range get_overlapping_element_ids_of_type(seqid, ...)Gets the ids for any elements that overlap a given range and are overlaps_type(seqid, start, end, element_type)Returns True if the given range overlaps an element of the given type
genomfart.parsers.vcf module¶
-
class
genomfart.parsers.vcf.VCF_parser(vcf_file)[source]¶ Parser for VCF files
Methods
get_affected_ref_bases(vcf_pos, ref_allele, ...)Gets the reference positions that are modified through the alternative allele, either through substitution or deletion. get_nw_aligned_alleles(ref_allele, alt_allele)Aligns alleles using the Needleman-Wunsch global alignment algorithm get_substituted_ref_bases(vcf_pos, ...)Gets the reference positions that are modified through the alternative allele through substitution. get_substituted_ref_bases_nw(vcf_pos, ...[, ...])Gets the reference positions that are modified through the alternative allele through substitution. parse_geno_depths()Iterates through the VCF file, getting the genotypes at each position parse_select_geno_depths(genos[, info_dict, ...])Iterates through the VCF file, getting selected genotypes at each position. parse_select_geno_generic(genos[, ...])Iterates through the VCF file, getting selected genotypes at each position. parse_site_infos([filter_excludes, ...])Iterates through the VCF file, getting the info for each site -
__init__(vcf_file)[source]¶ Instantiates a parser for the VCF file
Parameters: vcf_file : str
The path to a VCF file (May or may not be gzipped)
-
static
get_affected_ref_bases(vcf_pos, ref_allele, alt_allele)[source]¶ Gets the reference positions that are modified through the alternative allele, either through substitution or deletion. Note that this assumes that the first bases of the ref and alt alleles line up and that there is, at most, one indel in the alt_allele.
Parameters: vcf_pos : int
The position given for this variant in the VCF file
ref_allele : str
The reference allele given by the VCF file
alt_allele : str
The alternative allele given by the VCF file
Returns: Set of positions that are either substituted or deleted in the
alternative alleles
Examples
>>> VCF_parser.get_affected_ref_bases(20,'C','T') set([20]) >>> VCF_parser.get_affected_ref_bases(20,'C','CTAG') set([]) >>> VCF_parser.get_affected_ref_bases(20,'TCG','T') set([21, 22]) >>> VCF_parser.get_affected_ref_bases(20,'TCGCG','TCG') set([24, 23]) >>> VCF_parser.get_affected_ref_bases(20,'TCGCG','TCGCGCG') set([])
-
static
get_nw_aligned_alleles(ref_allele, alt_allele, match=1, mismatch=-2, gapopen=-4, gapextend=-1)[source]¶ Aligns alleles using the Needleman-Wunsch global alignment algorithm
Note that this will only return 1 of the alignments with the given score Parameters. This also assumes that the VCF always has the first bases of the allele aligned. ———-
- ref_allele : str
- The reference allele given by the VCF file
- alt_allele : str
- The alternative allele given by the VCF file
- match : number
- Score for matching a base
- mismatch : number
- Score for mismatching a base
- gapopen : number
- Score for opening a gap
- gapextend : number
- Score for extending a gap
Returns: aligned_seq1, aligned_seq2, score Examples
>>> VCF_parser.get_nw_aligned_alleles('C','T') ('C', 'T', -2) >>> VCF_parser.get_nw_aligned_alleles('C','CTAG') ('C---', 'CTAG', -5) >>> VCF_parser.get_nw_aligned_alleles('TCG','T') ('TCG', 'T--', -4) >>> VCF_parser.get_nw_aligned_alleles('TCGCG','TCG') ('TCGCG', 'TC--G', -2.0) >>> VCF_parser.get_nw_aligned_alleles('TCGCG','TCGGCGCG') ('TCG---CG', 'TCGGCGCG', -1.0) >>> VCF_parser.get_nw_aligned_alleles('TCGCG','TCGCGGCG') ('TCGCG---', 'TCGCGGCG', -1.0)
-
static
get_substituted_ref_bases(vcf_pos, ref_allele, alt_allele)[source]¶ Gets the reference positions that are modified through the alternative allele through substitution.
Note that this assumes that the first bases of the ref and alt alleles line up and that there is, at most, one indel in the alt_allele.
Parameters: vcf_pos : int
The position given for this variant in the VCF file
ref_allele : str
The reference allele given by the VCF file
alt_allele : str
The alternative allele given by the VCF file
Returns: Set of positions that are either substituted in the alternative allele
Examples
>>> VCF_parser.get_substituted_ref_bases(20,'C','T') set([20]) >>> VCF_parser.get_substituted_ref_bases(20,'C','CTAG') set([]) >>> VCF_parser.get_substituted_ref_bases(20,'TCG','T') set([]) >>> VCF_parser.get_substituted_ref_bases(20,'TCGCG','TCG') set([]) >>> VCF_parser.get_substituted_ref_bases(20,'TCGCG','TCGGCGCG') set([24, 23]) >>> VCF_parser.get_substituted_ref_bases(20,'TCGCG','TCGCGGCG') set([])
-
static
get_substituted_ref_bases_nw(vcf_pos, ref_allele, alt_allele, match=1, mismatch=-2, gapopen=-4, gapextend=-1)[source]¶ Gets the reference positions that are modified through the alternative allele through substitution. This is accomplished by first running a Needleman-Wunsch alignment of the 2 alleles and then finding the substitutions
Note that this assumes that the first bases of the ref and alt alleles line up
Parameters: vcf_pos : int
The position given for this variant in the VCF file
ref_allele : str
The reference allele given by the VCF file
alt_allele : str
The alternative allele given by the VCF file
match : number
Score for matching a base
mismatch : number
Score for mismatching a base
gapopen : number
Score for opening a gap
gapextend : number
Score for extending a gap
Returns: Set of positions that are either substituted in the alternative allele
Examples
>>> VCF_parser.get_substituted_ref_bases_nw(20,'C','T') set([20]) >>> VCF_parser.get_substituted_ref_bases_nw(20,'C','CTAG') set([]) >>> VCF_parser.get_substituted_ref_bases_nw(20,'TCG','T') set([]) >>> VCF_parser.get_substituted_ref_bases_nw(20,'TCGCG','TCG') set([]) >>> VCF_parser.get_substituted_ref_bases_nw(20,'TCGCG','TCGGCGCG') set([]) >>> VCF_parser.get_substituted_ref_bases_nw(20,'TCGCG','TCGCGGCG') set([])
-
parse_geno_depths()[source]¶ Iterates through the VCF file, getting the genotypes at each position
Returns: A generator that generates tuples of chrom,pos,(ref,alt1,alt2,...),
{sample->(base_depths)}
-
parse_select_geno_depths(genos, info_dict=False, use_chrom=None, start=None, end=None)[source]¶ Iterates through the VCF file, getting selected genotypes at each position. Note thtat this assumes samples contain depths
Parameters: genos : list
The names of the genotypes you want
info_dict : boolean
Whether you want the info dict on the end of the return
use_chrom : str, optional
Optional chromosome on which to start the scan
start : int, optional
Optional place to start the scan. If chrom also specified, it will be within the chromsoome. Otherwise, it will be within the first chromosome to have at least the start point
end : int, optional
Optional place to end the scan. (Nested within chrom if specified)
Returns: A generator that generates tuples of chrom,pos,(ref,alt1,alt2,...),{sample->base_depths),
<info_dict if desired>}
-
parse_select_geno_generic(genos, info_dict=False, use_chrom=None, start=None, end=None, filter_excludes=None, filter_requires=None)[source]¶ Iterates through the VCF file, getting selected genotypes at each position. This makes no assumptions about the format of the sample information for each genotype
Parameters: genos : list
The names of the genotypes you want
info_dict : dict
Whether you want the info dict on the end of the return
use_chrom : str, optional
Optional chromosome on which to start the scan
start : int, optional
Optional place to start the scan
end : int, optional
Optional place to end the scan
filter_excludes : set, optional
Filter tags that should exclude the locus from being returned
filter_requires : set, optional
Filter tags that should be required for a locus to be returned
Returns: A generator that generates tuples of chrom,pos,(ref,alt1,alt2,...),{sample->{prefix->val}),
<info_dict if desired>}
-
parse_site_infos(filter_excludes=None, filter_requires=None)[source]¶ Iterates through the VCF file, getting the info for each site
Parameters: filter_excludes : set, optional
Filter tags that should exclude the locus from being returned
filter_requires : set, optional
Filter tags that should be required for a locus to be returned
Returns: A generator that generates tuples of chrom,pos,{ref,alt1,alt2,...),{field->val}.
Fields without a corresponding value will have value “None”
-