The Python package OffsetBasedGraph is a simple implementation of offset-based sequence graphs. The package lets you represent graphs as well as the Translation (difference) between them, making it possible to e.g. convert intervals from one graph to another.
Installation¶
The package with dependencies can be installed using pip:
pip3 install offsetbasedgraph
If you want to use the example data to run the examples using real data, you need to clone the repository:
git clone https://github.com/uio-cels/OffsetBasedGraph.git
cd OffsetBasedGraph
pip3 install -e
Test that installation works, e.g. by trying to import the package:
python3
>>> import offsetbasedgraph
Quickstart¶
The following are some simple examples to get you started. Alternatively, you can jump directly to the the documentation below.
Create a simple graph¶
Graphs can be created by passing a dict of blocks and a dict of edges to the init method. In the following example, we create a very simple graph consisting of two connected blocks, and save that graph to file:
from offsetbasedgraph import Block, Graph
blocks = {"chr1-part1": Block(20), "chr1-part2": Block(10)}
adjencies = {"chr1-part1": ["chr1-part2"]}
graph = Graph(blocks, adjencies)
graph.to_file("mygraph.graph")
The graph can later be read from file again:
graph = Graph.from_file("mygraph.graph")
Translating between graphs¶
Offset-based graphs are more interesting when one can represent intervals on them, and “translate” intervals between different graphs.
Assume we have a simple graph with an alternative locus and chrosome 1 as two blocks, not connected, and that we have a gene on chromosome 1 (represented as a simple Interval):
from offsetbasedgraph import Block, Graph, Interval, Translation
graph = Graph(
{
"chr1": Block(20),
"chr1_alt1": Block(10),
},
{}
)
gene = Interval(0, 8, ["chr1"], graph)
print(gene)
Now, assume we have a new graph where the first 5 base pairs of chr1 have been merged with the first 5 base pairs of chr1_alt1. This operation can easily be expressed by a Translation (see Translation class for details):
merge_translation = Translation(
{
"chr1_alt1": [Interval(0, 5, ["merged_part", "chr1_alt1-unmerged"])],
"chr1": [Interval(0, 15, ["merged_part", "chr1-unmerged"])]
},
{
"merged_part": [Interval(0, 5, ["chr1"]), Interval(0, 5, ["chr1_alt1"], graph)],
"chr1-unmerged": [Interval(5, 20, ["chr1"], graph)],
"chr1_alt1-unmerged": [Interval(5, 10, ["chr1_alt1"], graph)]
},
graph=graph
)
Now, it is easy to translate the gene interval to our merged graph:
translated_interval = merge_translation.translate(gene)
print(translated_interval)
Intervals and positions¶
Positions consist of a region path ID and an offset:
from offsetbasedgraph import Position
my_position = Position("chr1_some_alt", 100)
Singlepath intervals consist of a start position, end position and a list of region paths ids:
from offsetbasedgraph import Interval
start_position = Position("chr1", 100)
end_position = Position("chr1_some_alt", 100)
interval = Interval(start_position, end_position, ["chr1", "chr1_some_alt"])
The interval init method can also be initialized using only the start and end offsets instead of full start and end positions:
interval = Interval(100, 100, ["chr1", "chr1_some_alt"])
The package also supports some multipath intervals. See the documentation.
Example using GRCh38¶
This example assumes that you have cloned the repository (see Installation) and that you are positioned in the examples directory. The code for this example can be found in examples/grch38_example.py.
Included in the package are some utility functions that makes it easy to create GRCh38-like graphs. We here create a simple graph covering an alt locus on GRCh38:
from offsetbasedgraph.graphcreators import create_initial_grch38_graph, convert_to_numeric_graph
graph = create_initial_grch38_graph("chrom.sizes.example")
numeric_graph, name_translation = convert_to_numeric_graph(graph)
print(numeric_graph)
graph is now a simple graph with two blocks (an alt locus and chr1), and numeric_graph is the same graph converted so that all blocks have numeric IDs. name_translation is a translation from the original block names to the numeric format. This numeric format makes it easier to work with the graph, but we can always get back to the original graph later by using the translation object.
We now connect the alt locus to chr1 and remove flanking regions of the alt locus (regions identical to the main chromosome):
from offsetbasedgraph.graphcreators import connect_without_flanks
connected_graph, connected_trans = connect_without_flanks(numeric_graph, "alt.loci.example", name_translation)
print(connected_graph)
connected_trans now represents all the operations done when connecting the alt locus. Using this translation object, we can translate intervals from an original GRC3h8 graph (graph) to our connected_graph:
from offsetbasedgraph.gene import GeneList
gene_list = GeneList.from_file("genes.example")
translated_genes = gene_list.translate(connected_trans)
We have now successfully represented genes on a graph based on GRCh38.
Create full GRCh38 graph¶
The following shows a short snippet for building a graph from GRCh38, with merged flanks. It assumes you are positioned in the examples directory. The code for this example can be found in examples/full_grch38_graph_example.py.
NB: This code takes time to run, as remote sequence data needs to be downloaded:
from offsetbasedgraph.graphutils import *
graph = create_initial_grch38_graph("grch38.chrom.sizes")
numeric_graph, name_translation = convert_to_numeric_graph(graph)
new_numeric_graph, numeric_translation = connect_without_flanks(
numeric_graph, "grch38_alt_loci.txt", name_translation)
name_graph, new_name_translation = convert_to_text_graph(
new_numeric_graph, name_translation, numeric_translation)
final_translation = name_translation + numeric_translation + new_name_translation
final_translation.graph2 = name_graph
grch38_graph_with_flanks = name_graph
print(grch38_graph_with_flanks.summary())
# Translate example genes
gene_list = GeneList.from_file("genes.example")
print(gene_list)
translated_gene_list = gene_list.translate(final_translation)
print(translated_gene_list)
Documentation¶
Graph¶
-
class
offsetbasedgraph.
Graph
(blocks, adj_list, create_reverse_adj_list=True, rev_adj_list=None)¶ Bases:
object
Class for holding an Offset-Based Graph and performing simple operations on this graph.
Does this by storing a dict of Blocks and a dict of adjency lists (one list for each block that has adjencies).
>>> blocks = {"chr1-0": 100000, "chr1-1": 50000, "alt": 40000} >>> adj_list = {"chr1-0": ["chr1-1", "alt"]} >>> graph = Graph(blocks, adj_list)
-
assert_interval_in_graph
(interval)¶ Check that whole interval is contained in self
Parameters: interval – Interval
-
assert_position_in_graph
(position, exclusive=False)¶ Check that a position is in the graph
Parameters: - position – Position
- exclusive – Wheter to use exclusive end
-
static
block_origin
(name)¶ Checks if block is merged, alt or main based on the name
Parameters: name – block name Returns: merged, alt or main
-
connect_postitions
(position_a, position_b)¶ Connect position_a to position_b with an edge and split the region paths if necessary. Create translation object and new graph
Parameters: - position_a – Position
- position_b – Position
Returns: New graph and Translation object
Return type: (Graph, Translation)
-
create_subgraph_from_blocks
(blocks, alt_locus=None)¶ Creates a subgraph using existing edges and only the blocks send as argument
Parameters: - blocks – list of block ids
- alt_locus – If not None, alt loci blocks not from this alt locus will not be added to graph. This will typically be parallell alt loci that potentially can be added
Returns: Returns a new graph
-
create_subgraph_from_intervals
(intervals, padding=10000, alt_locus=None, base_trans=None)¶ Creates a subgraph containing all the intervals
Parameters: - intervals – list of intervals. All region paths in the intervals must create a connected subgraph.
- padding – number of baseapairs that should be in graph before first and after last intervals
Returns:
-
find_all_critical_blocks
()¶ Find critical blocks in the graph. I.e. blocks that are traversed by all paths
Returns: list of critical_blocks ids Return type: list(str)
-
find_critical_blocks
(start_block)¶ Find all critical_blocks starting from start block
Parameters: start_block – block id of start_block Returns: List of critical block ids Return type: list(str)
-
find_parallell_blocks
(blocks, filter_func)¶ Find all parallell_blocks to block that satisfies filter_func
Parameters: - block – region path id
- filter_func – predicate function
Returns: all parallell_blocks
Return type: list(str)
-
static
from_file
(file_name)¶ Load graph from pickle
Parameters: file_name – File name Return type: Graph
-
get_first_blocks
()¶ Parameters: graph – Graph Returns: Return a list of all blocks having no incoming edges Return type: list(Graph)
-
get_last_blocks
()¶ Parameters: graph – Graph Returns: Returns a list of all blocks having no incoming edges Return type: list(Graph)
-
has_identical_structure
(other)¶ Checks if this graph has identical structure (edges and blocks) to other graph. Size of region paths is ignores (and can be different).
Parameters: other – Graph to compare with Returns: True if identical, otherwise False
-
static
is_main_name
(name)¶ Check if name comes from a block on the main path
Parameters: name – block_id Return type: bool
-
static
level_dict
(blocks)¶ Return dict with block as key and level as value
Parameters: blocks – block_id Return type: dict
-
merge
(intervals)¶ Merge the given intervals in the graph, and return the resulting graph after merge and a translation object.
Parameters: intervals – list of intevals to merge Returns: translation object, resulting graph Return type: (Graph, Translation)
-
n_edges_in
(block)¶ Finds and returns the number of edges going in to a block
Parameters: block – Returns: Returns the number of edges
-
next_position
(pos)¶ Parameters: pos – Position Returns: The the next position in graph Return type: Position
-
prev_position
(pos)¶ Parameters: pos – Position Returns: The previous position in graph Return type: Position
-
remove
(block_id)¶ Remove a block including edges from the graph
Parameters: block_id – block id of the block
-
summary
()¶ Return summary text
Return type: str
-
to_file
(file_name)¶ Writes the graph to file so that it later can be recreated using the from_file method
Parameters: file_name – File name Returns:
-
Translation¶
-
class
offsetbasedgraph.
Translation
(translate_dict={}, reverse_dict={}, graph=None, block_lengths=None)¶ Holds information necessary to translate coordinates and intervals from one graph to another
>>> forward_dict = {"chr1": [Interval(0, 100, ["chr1-0", "chr1-1", "chr1-2"])], "alt": [Interval(0,100, ["chr1-1, "alt-0"])]} >>> reverse_dict = {"chr1-0": [Interval(0, 50, ["chr1"])], "chr1-1": [Interval(50, 110, ["chr1"]), Interval(0,50, ["alt"])], "chr1-2": [Interval(110, 210, ["chr1"])], "alt-0": [Interval(50, 100, ["alt"])]} >>> translation = Translation(forward_dict, reverse_dict, graph)
-
copy
()¶ Make new Translation object with copies of forward and reverse translation dicts
Return type: Translation
-
static
from_file
(file_name)¶ Reads Translation object from pickle
-
get_external_edges
(subgraph, edges, rev_edges)¶ Get new external edges, i.e edges between region paths that have been translated
Parameters: - subgraph – subgraph to be translated
- edges – Adj list to update
Return type: None (updates edges)
-
get_internal_edges
(subgraph, edges, reverse_edges)¶ Find new edges gotten from splitting region paths under current translation
Parameters: - subgraph – former subgraph
- edges – adjacency list to be updated
- reverse_edges – reverse adjacency list to be updated
-
get_old_edges
(subgraph)¶ Find all edges in subgraph that are still valid after this translation
Parameters: subgraph – former subgraph Returns: Adjacancy list of valid edges Return type: defaultdict(list)
-
classmethod
make_name_translation
(trans_dict, graph)¶ Creates a copied version of trans_dict where region path have name IDs
Parameters: - trans_dict –
- graph –
Returns: Returns a new translation object
-
set_graph2
(graph2)¶ Set graph2 (“other graph”) and update all Interval objects in forward dict so that they have this graph.
Parameters: graph2 – Graph
-
to_file
(file_name)¶ Writes translation object to pickle
Parameters: file_name – File name
-
translate
(obj)¶ Check type of obj and translate accordingly Only for use where there is a one-to-one translation
Parameters: obj – Interval/Position Returns: translated object Return type: Interval/Position
-
translate_interval
(interval, inverse=False)¶ Translate an interval between the two coordinate systems.
Parameters: - interval – Interval
- inverse – wheter to use the inverse translation
Returns: Returns an interval. If inverse is True, a list of intervals is returned.
-
translate_position
(position, inverse=False)¶ Translates a position
Parameters: - position – Position
- inverse – If True, translate back
Returns: Returns the translated Position
Return type:
-
translate_rp
(rp, inverse=False)¶ Translate region path, by dict lookup
Parameters: rp – region path id Returns: translated interval Return type: list(Interval)
-
translate_subgraph
(subgraph)¶ Translates a graph (forward). The graph has to be a subgraph of graph1 in the translation object.
Parameters: subgraph – Subgraph to translate. Returns: Returns the translated subgraph
-
Singlepath intervals¶
-
class
offsetbasedgraph.
Interval
(start_position, end_position, region_paths=None, graph=None)¶ Represents an interval on a graph
>>> Interval(Position("chr1-0", 10), Position("chr1-1", 100), ["chr1-0", "alt, "chr1-1"]), graph)
-
contains
(other, tolerance=0)¶ Check if transcription region and all exons of other are contained in self. Accecpt difference in start and end coordinates lower than tolerance
Parameters: - other – other gene
- tolerance – number of basepair tolerance
Returns: wheter self conatins other
Return type: bool
-
contains_rp
(region_path)¶ Check if interval uses region_path
Parameters: region_path – region path id Return type: bool
-
get_adj_list
()¶ Returns: Returns every adjency in the interval as a list of tuples (from_block_id, to_block_id)
-
get_position_from_offset
(offset, rp_lens=None)¶ Get position of with offset counted from the start of the interval
Parameters: offset – Returns: The position in the graph Return type: Position
-
notation
()¶ Returns: Returns a human readable notation of the interval
-
Multi-path intervals¶
-
class
offsetbasedgraph.
GeneralMultiPathInterval
(start_positions, end_positions, region_paths, graph)¶ Holds a general multipath interval by representing all possible start positions, end positions and all possible region paths within the interval.
A method can return (in some cases) a trivial set of single path intervals contained by this multi path interval
-
get_single_path_intervals
()¶ Generate all single path intervals.
Returns: generator
-
-
class
offsetbasedgraph.
CriticalPathsMultiPathInterval
(start_pos, end_pos, critical_intervals)¶
Position¶
Genes¶
-
class
offsetbasedgraph.gene.
Gene
(name, transcription_region, exons, coding_region, strand, cds_status)¶ -
approxEquals
(other, tolerance=0)¶ Check if other is approximately equal to self. Difference between start and end positions of transcription region as well as all exons must be less than tolerance
Parameters: - other – Gene
- tolerance – allowed differnce (int)
Return type: bool
-
classmethod
from_dict
(attr_dict)¶ Create Gene object from gene dict obtained from csv dict reader
Parameters: attr_dict – gene dict Returns: Gene object Return type: Gene
-
get_contained_pairs
(other, tolerance=0)¶ Get list of all pairs (my, his) of exons in other that are contained in an exon in self
Parameters: - other – Gene
- tolerance – tolerance for contains (int)
Returns: list of pairs of (container, containee) exons
Return type:
-
get_transcript_length
()¶ Return combined length of exons
Returns: transcript length Return type: int
-
length
()¶ Return length of transcription region :rtype: int
-
multiple_alt_loci
()¶ Returns true if gene stretches over multiple alt loci. NB: Works only when gene is represented on a graph where alt loci blocks have names with alt loci in them
-
Utility functions¶
Graphutils¶
A collection of utility functions used to work with Graphs and Translations using real data.
-
offsetbasedgraph.graphutils.
analyze_genes_on_merged_graph
(genes, translation)¶ Find similarities between gene annotations on alt-loci and main chromosomes when represented on a merged graph Translate the genes to the merged graph and perform analysis
Parameters: - genes – list of genes on original graph
- translation – translation from original to merged graph
-
offsetbasedgraph.graphutils.
analyze_multipath_genes_for_alt
(alt_id, alt_loci_genes, main_genes, graph, name_trans, ncbi_alignments_dir)¶ Find genes with the same fuzzy multipath interval representation
Parameters: - alt_id – Alt locus id
- alt_loci_genes – Genes on alt loci
- main_genes – Genes on main chromosome paralell to alt loci
- graph – Graph
- name_trans – Translation
- ncbi_alignments_dir – Directory to find alt-loci info
Returns: Number of genes on alt-loci with match on main chromsome
Return type: int
-
offsetbasedgraph.graphutils.
create_gene_dicts
(genes, alt_loci_fn, identical_names=False)¶ Takes a list of genes, and creates an alt loci dict and a gene name dict. The dicts index the genes on chromosome/alt_loci id
Parameters: genes – list of genes Returns: alt loci dict, name dict, paralell_dict Return type: defaultdict, defaultdict, defaultdict
-
offsetbasedgraph.graphutils.
create_subgraph_around_alt_locus
(graph, trans, alt_locus, padding=200000, alt_loci_fn='grch38_alt_loci.txt')¶ Takes a translation from an original grch38 graph (created by create_initial_grch38_graph) and uses a translation from that graph to current graph in order to filter out the subgraph around a particular alt locus.
Parameters: - graph – Graph to create subgraph from
- trans – Translation from original grch38 graph to current graph
- alt_locus – alt locus ID to create subgraph around
- padding – Padding. Number if basepairs to include around alt locus
Returns: Returns subgraph, a translation object and start position of the subgraph
-
offsetbasedgraph.graphutils.
fuzzy_gene_analysis
(genes, text_graph, ncbi_alignments_dir, alt_loci_filename)¶ Find number of genes on alt-loci that can be represented by the fuzzy multipath interval of a gene on a main chromsome when using a complex graph
Parameters: - genes – List of genes
- text_graph – Graph
- ncbi_alignments_dir – Directory alignments
- alt_loci_filename – File with alt loci info
-
offsetbasedgraph.graphutils.
get_alt_loci_positions
(alt_loci_file_name)¶ Create a dict of alt loci positions read from alt_loci_file_name
Parameters: alt_loci_file_name – file name Returns: dict{alt_loci_id: info (dict)} Return type: dict{str: dict}
-
offsetbasedgraph.graphutils.
get_gene_objects_as_intervals
(fn, graph=None)¶ Returns a dict. Keys are gene names and values are intervals representing the gene
Parameters: fn – File name of file containing genes (on the format of UCSC) Returns:
-
offsetbasedgraph.graphutils.
parse_genes_file
(genes_fn)¶ Parses a file containing genes (on the format of UCSC), and returns a list of dicts
Parameters: genes_fn – File name Returns: Returns a list of genes, one dict for each gene
-
offsetbasedgraph.graphutils.
translate_single_gene_to_aligned_graph
(gene, trans)¶ Translate the gene to a multipath gene on trans.graph2
Parameters: - gene – Gene
- trans – Translation
Return type: MultiPathGene
-
offsetbasedgraph.graphutils.
translate_to_fuzzy_interval
(gene, trans)¶ Translate the gene to a fuzzy interval on graph2 of trans
Parameters: - gene – Gene
- trans – Translation
Return type: FuzzyGene
Graphcreators¶
This is a collection of methods for creating and modifying graphs.
-
offsetbasedgraph.graphcreators.
connect_without_flanks
(graph, alt_loci_fn, name_translation, filter_alt_loci=[])¶ Connects the alternative loci in the given file to the grch38 graph, without flanks.
Parameters: - alt_loci_fn – Filename of file containing alternative loci. One alt locus on each line. Four columns: alt_locus_id chr chr_start chr_stop
- filter_alt_loci – If not empty, only these alt loci will be connected
Returns: Returns the new graph and translation from graph to new graph
Return type: (Graph, Translation)
-
offsetbasedgraph.graphcreators.
convert_to_numeric_graph
(graph)¶ Convert graph with str region paths id to graph with int region path ids
Parameters: graph – Graph Returns: Graph with int region path ids + Translation Return type: (Graph, Translation)
-
offsetbasedgraph.graphcreators.
convert_to_text_graph
(graph, name_translation, numeric_translation)¶ Convert graph with numeric region path ids to graph with string region path ids. The new ids are generated by merging the string region path ids that are translated to the numeric ids in name_translation
Parameters: - graph – Graph with numeric region path ids
- name_translation – Translation from string ids to numeric ids
- numeric_translation – Translation between numeric id graphs
Returns: Graph with string ids + Translation
Return type: (Graph, Translation)
-
offsetbasedgraph.graphcreators.
create_initial_grch38_graph
(chrom_sizes_fn)¶ Creates an initial grch38 graph with no connected blocks
Parameters: chrom_sizes_fn – Filename of chrom sizes file Returns: Returns a Graph with no connected blocks Return type: Graph
-
offsetbasedgraph.graphcreators.
grch38_graph_to_numeric
(original_grch38_graph)¶ Convert a GRCh38 graph with string ids (chromosome id) to isomorphic graph with numeric ids
Parameters: original_grch38_graph – Graph Returns: Graph with numeric region path ids + Translation Return type: (Graph, Translation)
-
offsetbasedgraph.graphcreators.
merge_alt_using_cigar
(original_numeric_grch38_graph, trans, alt_id, ncbi_alignments_dir)¶ Uses a cigar string from ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/001/405/GCA_000001405.15_GRCh38/GCA_000001405.15_GRCh38_assembly_structure/ and merges the alt locus.
Parameters: - original_numeric_grch38_graph –
- trans – Translation object from original grch38 graph to numeric (returned by grch38_graph_to_numeric)
- alt_id – Alt id (e.g. chr2_KI270774v1_alt)
- ncbi_alignments_dir – Directory containing preprocessed ncbi alignment files
Returns: Returns the new graph and a translation object between the original grch38 graph and the new graph
Return type: (Graph, Translation)
-
offsetbasedgraph.graphcreators.
merge_flanks
(intervals, final_trans, new_graph, name_translation)¶ Merges the start and end flanks represented in intervals on the given graph
Parameters: - intervals – List of start and end flanks to merge [start, start, end, end] Intervals should be on first graph (i.e. name_translation.graph1)
- final_trans – Trans that will be updated
- new_graph – Current graph
- name_translation – Translation from human readable names to numeric IDs. Can be an empty translation
Returns: Returns the new graph and translation as a tuple
Return type: (Graph, Translation)