The Python package OffsetBasedGraph is a simple implementation of offset-based sequence graphs. The package lets you represent graphs as well as the Translation (difference) between them, making it possible to e.g. convert intervals from one graph to another.

Installation

The package with dependencies can be installed using pip:

pip3 install offsetbasedgraph

If you want to use the example data to run the examples using real data, you need to clone the repository:

git clone https://github.com/uio-cels/OffsetBasedGraph.git
cd OffsetBasedGraph
pip3 install -e

Test that installation works, e.g. by trying to import the package:

python3
>>> import offsetbasedgraph

Quickstart

The following are some simple examples to get you started. Alternatively, you can jump directly to the the documentation below.

Create a simple graph

Graphs can be created by passing a dict of blocks and a dict of edges to the init method. In the following example, we create a very simple graph consisting of two connected blocks, and save that graph to file:

from offsetbasedgraph import Block, Graph
blocks = {"chr1-part1": Block(20), "chr1-part2": Block(10)}
adjencies = {"chr1-part1": ["chr1-part2"]}
graph = Graph(blocks, adjencies)
graph.to_file("mygraph.graph")

The graph can later be read from file again:

graph = Graph.from_file("mygraph.graph")

Translating between graphs

Offset-based graphs are more interesting when one can represent intervals on them, and “translate” intervals between different graphs.

Assume we have a simple graph with an alternative locus and chrosome 1 as two blocks, not connected, and that we have a gene on chromosome 1 (represented as a simple Interval):

from offsetbasedgraph import Block, Graph, Interval, Translation
graph = Graph(
    {
        "chr1": Block(20),
        "chr1_alt1": Block(10),
    },
    {}
)

gene = Interval(0, 8, ["chr1"], graph)
print(gene)

Now, assume we have a new graph where the first 5 base pairs of chr1 have been merged with the first 5 base pairs of chr1_alt1. This operation can easily be expressed by a Translation (see Translation class for details):

merge_translation = Translation(
    {
        "chr1_alt1": [Interval(0, 5, ["merged_part", "chr1_alt1-unmerged"])],
        "chr1": [Interval(0, 15, ["merged_part", "chr1-unmerged"])]
    },
    {
        "merged_part": [Interval(0, 5, ["chr1"]), Interval(0, 5, ["chr1_alt1"], graph)],
        "chr1-unmerged": [Interval(5, 20, ["chr1"], graph)],
        "chr1_alt1-unmerged": [Interval(5, 10, ["chr1_alt1"], graph)]
    },
    graph=graph
)

Now, it is easy to translate the gene interval to our merged graph:

translated_interval = merge_translation.translate(gene)
print(translated_interval)

Intervals and positions

Positions consist of a region path ID and an offset:

from offsetbasedgraph  import Position
my_position = Position("chr1_some_alt", 100)

Singlepath intervals consist of a start position, end position and a list of region paths ids:

from offsetbasedgraph import Interval
start_position = Position("chr1", 100)
end_position = Position("chr1_some_alt", 100)
interval = Interval(start_position, end_position, ["chr1", "chr1_some_alt"])

The interval init method can also be initialized using only the start and end offsets instead of full start and end positions:

interval = Interval(100, 100, ["chr1", "chr1_some_alt"])

The package also supports some multipath intervals. See the documentation.

Example using GRCh38

This example assumes that you have cloned the repository (see Installation) and that you are positioned in the examples directory. The code for this example can be found in examples/grch38_example.py.

Included in the package are some utility functions that makes it easy to create GRCh38-like graphs. We here create a simple graph covering an alt locus on GRCh38:

from offsetbasedgraph.graphcreators import create_initial_grch38_graph, convert_to_numeric_graph
graph = create_initial_grch38_graph("chrom.sizes.example")
numeric_graph, name_translation = convert_to_numeric_graph(graph)
print(numeric_graph)

graph is now a simple graph with two blocks (an alt locus and chr1), and numeric_graph is the same graph converted so that all blocks have numeric IDs. name_translation is a translation from the original block names to the numeric format. This numeric format makes it easier to work with the graph, but we can always get back to the original graph later by using the translation object.

We now connect the alt locus to chr1 and remove flanking regions of the alt locus (regions identical to the main chromosome):

from offsetbasedgraph.graphcreators import connect_without_flanks
connected_graph, connected_trans = connect_without_flanks(numeric_graph, "alt.loci.example", name_translation)
print(connected_graph)

connected_trans now represents all the operations done when connecting the alt locus. Using this translation object, we can translate intervals from an original GRC3h8 graph (graph) to our connected_graph:

from offsetbasedgraph.gene import GeneList
gene_list = GeneList.from_file("genes.example")
translated_genes = gene_list.translate(connected_trans)

We have now successfully represented genes on a graph based on GRCh38.

Create full GRCh38 graph

The following shows a short snippet for building a graph from GRCh38, with merged flanks. It assumes you are positioned in the examples directory. The code for this example can be found in examples/full_grch38_graph_example.py.

NB: This code takes time to run, as remote sequence data needs to be downloaded:

from offsetbasedgraph.graphutils import *
graph = create_initial_grch38_graph("grch38.chrom.sizes")
numeric_graph, name_translation = convert_to_numeric_graph(graph)
new_numeric_graph, numeric_translation = connect_without_flanks(
    numeric_graph, "grch38_alt_loci.txt", name_translation)
name_graph, new_name_translation = convert_to_text_graph(
    new_numeric_graph, name_translation, numeric_translation)
final_translation = name_translation + numeric_translation + new_name_translation
final_translation.graph2 = name_graph

grch38_graph_with_flanks = name_graph

print(grch38_graph_with_flanks.summary())

# Translate example genes
gene_list = GeneList.from_file("genes.example")
print(gene_list)

translated_gene_list = gene_list.translate(final_translation)
print(translated_gene_list)

Documentation

Graph

class offsetbasedgraph.Graph(blocks, adj_list, create_reverse_adj_list=True, rev_adj_list=None)

Bases: object

Class for holding an Offset-Based Graph and performing simple operations on this graph.

Does this by storing a dict of Blocks and a dict of adjency lists (one list for each block that has adjencies).

>>> blocks = {"chr1-0": 100000, "chr1-1": 50000, "alt": 40000}
>>> adj_list = {"chr1-0": ["chr1-1", "alt"]}
>>> graph = Graph(blocks, adj_list)
assert_interval_in_graph(interval)

Check that whole interval is contained in self

Parameters:interval – Interval
assert_position_in_graph(position, exclusive=False)

Check that a position is in the graph

Parameters:
  • position – Position
  • exclusive – Wheter to use exclusive end
static block_origin(name)

Checks if block is merged, alt or main based on the name

Parameters:name – block name
Returns:merged, alt or main
connect_postitions(position_a, position_b)

Connect position_a to position_b with an edge and split the region paths if necessary. Create translation object and new graph

Parameters:
  • position_a – Position
  • position_b – Position
Returns:

New graph and Translation object

Return type:

(Graph, Translation)

copy()

Make a copy of the graph

Returns:copy of the graph
Return type:Graph
create_subgraph_from_blocks(blocks, alt_locus=None)

Creates a subgraph using existing edges and only the blocks send as argument

Parameters:
  • blocks – list of block ids
  • alt_locus – If not None, alt loci blocks not from this alt locus will not be added to graph. This will typically be parallell alt loci that potentially can be added
Returns:

Returns a new graph

create_subgraph_from_intervals(intervals, padding=10000, alt_locus=None, base_trans=None)

Creates a subgraph containing all the intervals

Parameters:
  • intervals – list of intervals. All region paths in the intervals must create a connected subgraph.
  • padding – number of baseapairs that should be in graph before first and after last intervals
Returns:

find_all_critical_blocks()

Find critical blocks in the graph. I.e. blocks that are traversed by all paths

Returns:list of critical_blocks ids
Return type:list(str)
find_critical_blocks(start_block)

Find all critical_blocks starting from start block

Parameters:start_block – block id of start_block
Returns:List of critical block ids
Return type:list(str)
find_parallell_blocks(blocks, filter_func)

Find all parallell_blocks to block that satisfies filter_func

Parameters:
  • block – region path id
  • filter_func – predicate function
Returns:

all parallell_blocks

Return type:

list(str)

static from_file(file_name)

Load graph from pickle

Parameters:file_name – File name
Return type:Graph
get_first_blocks()
Parameters:graph – Graph
Returns:Return a list of all blocks having no incoming edges
Return type:list(Graph)
get_last_blocks()
Parameters:graph – Graph
Returns:Returns a list of all blocks having no incoming edges
Return type:list(Graph)
has_identical_structure(other)

Checks if this graph has identical structure (edges and blocks) to other graph. Size of region paths is ignores (and can be different).

Parameters:other – Graph to compare with
Returns:True if identical, otherwise False
static is_main_name(name)

Check if name comes from a block on the main path

Parameters:name – block_id
Return type:bool
static level_dict(blocks)

Return dict with block as key and level as value

Parameters:blocks – block_id
Return type:dict
merge(intervals)

Merge the given intervals in the graph, and return the resulting graph after merge and a translation object.

Parameters:intervals – list of intevals to merge
Returns:translation object, resulting graph
Return type:(Graph, Translation)
n_edges_in(block)

Finds and returns the number of edges going in to a block

Parameters:block
Returns:Returns the number of edges
next_position(pos)
Parameters:pos – Position
Returns:The the next position in graph
Return type:Position
prev_position(pos)
Parameters:pos – Position
Returns:The previous position in graph
Return type:Position
remove(block_id)

Remove a block including edges from the graph

Parameters:block_id – block id of the block
summary()

Return summary text

Return type:str
to_file(file_name)

Writes the graph to file so that it later can be recreated using the from_file method

Parameters:file_name – File name
Returns:

Translation

class offsetbasedgraph.Translation(translate_dict={}, reverse_dict={}, graph=None, block_lengths=None)

Holds information necessary to translate coordinates and intervals from one graph to another

>>> forward_dict = {"chr1": [Interval(0, 100, ["chr1-0", "chr1-1", "chr1-2"])],
"alt": [Interval(0,100, ["chr1-1, "alt-0"])]}
>>> reverse_dict = {"chr1-0": [Interval(0, 50, ["chr1"])],
"chr1-1": [Interval(50, 110, ["chr1"]), Interval(0,50, ["alt"])],
"chr1-2": [Interval(110, 210, ["chr1"])],
"alt-0": [Interval(50, 100, ["alt"])]}
>>> translation = Translation(forward_dict, reverse_dict, graph)
copy()

Make new Translation object with copies of forward and reverse translation dicts

Return type:Translation
static from_file(file_name)

Reads Translation object from pickle

get_external_edges(subgraph, edges, rev_edges)

Get new external edges, i.e edges between region paths that have been translated

Parameters:
  • subgraph – subgraph to be translated
  • edges – Adj list to update
Return type:

None (updates edges)

get_internal_edges(subgraph, edges, reverse_edges)

Find new edges gotten from splitting region paths under current translation

Parameters:
  • subgraph – former subgraph
  • edges – adjacency list to be updated
  • reverse_edges – reverse adjacency list to be updated
get_old_edges(subgraph)

Find all edges in subgraph that are still valid after this translation

Parameters:subgraph – former subgraph
Returns:Adjacancy list of valid edges
Return type:defaultdict(list)
classmethod make_name_translation(trans_dict, graph)

Creates a copied version of trans_dict where region path have name IDs

Parameters:
  • trans_dict
  • graph
Returns:

Returns a new translation object

set_graph2(graph2)

Set graph2 (“other graph”) and update all Interval objects in forward dict so that they have this graph.

Parameters:graph2 – Graph
to_file(file_name)

Writes translation object to pickle

Parameters:file_name – File name
translate(obj)

Check type of obj and translate accordingly Only for use where there is a one-to-one translation

Parameters:obj – Interval/Position
Returns:translated object
Return type:Interval/Position
translate_interval(interval, inverse=False)

Translate an interval between the two coordinate systems.

Parameters:
  • interval – Interval
  • inverse – wheter to use the inverse translation
Returns:

Returns an interval. If inverse is True, a list of intervals is returned.

translate_position(position, inverse=False)

Translates a position

Parameters:
  • position – Position
  • inverse – If True, translate back
Returns:

Returns the translated Position

Return type:

Position

translate_rp(rp, inverse=False)

Translate region path, by dict lookup

Parameters:rp – region path id
Returns:translated interval
Return type:list(Interval)
translate_subgraph(subgraph)

Translates a graph (forward). The graph has to be a subgraph of graph1 in the translation object.

Parameters:subgraph – Subgraph to translate.
Returns:Returns the translated subgraph

Singlepath intervals

class offsetbasedgraph.Interval(start_position, end_position, region_paths=None, graph=None)

Represents an interval on a graph

>>> Interval(Position("chr1-0", 10), Position("chr1-1", 100),
["chr1-0", "alt, "chr1-1"]), graph)
contains(other, tolerance=0)

Check if transcription region and all exons of other are contained in self. Accecpt difference in start and end coordinates lower than tolerance

Parameters:
  • other – other gene
  • tolerance – number of basepair tolerance
Returns:

wheter self conatins other

Return type:

bool

contains_rp(region_path)

Check if interval uses region_path

Parameters:region_path – region path id
Return type:bool
get_adj_list()
Returns:Returns every adjency in the interval as a list of tuples (from_block_id, to_block_id)
get_position_from_offset(offset, rp_lens=None)

Get position of with offset counted from the start of the interval

Parameters:offset
Returns:The position in the graph
Return type:Position
notation()
Returns:Returns a human readable notation of the interval

Multi-path intervals

class offsetbasedgraph.GeneralMultiPathInterval(start_positions, end_positions, region_paths, graph)

Holds a general multipath interval by representing all possible start positions, end positions and all possible region paths within the interval.

A method can return (in some cases) a trivial set of single path intervals contained by this multi path interval

get_single_path_intervals()

Generate all single path intervals.

Returns:generator
class offsetbasedgraph.CriticalPathsMultiPathInterval(start_pos, end_pos, critical_intervals)
class offsetbasedgraph.FuzzyMultipathInterval(start_offset, end_offset, region_paths, threshold, graph=None)
length()
Returns:The length of the central path of interval
Return type:int

Position

class offsetbasedgraph.Position(region_path_id, offset)

Represents a position in the graph

>>> Position("chr1-1", 100)
notation()
Returns:Returns a numan readable string notation

Genes

class offsetbasedgraph.gene.Gene(name, transcription_region, exons, coding_region, strand, cds_status)
approxEquals(other, tolerance=0)

Check if other is approximately equal to self. Difference between start and end positions of transcription region as well as all exons must be less than tolerance

Parameters:
  • other – Gene
  • tolerance – allowed differnce (int)
Return type:

bool

classmethod from_dict(attr_dict)

Create Gene object from gene dict obtained from csv dict reader

Parameters:attr_dict – gene dict
Returns:Gene object
Return type:Gene
get_contained_pairs(other, tolerance=0)

Get list of all pairs (my, his) of exons in other that are contained in an exon in self

Parameters:
  • other – Gene
  • tolerance – tolerance for contains (int)
Returns:

list of pairs of (container, containee) exons

Return type:

list( (Interval, Interval)..)

get_transcript_length()

Return combined length of exons

Returns:transcript length
Return type:int
length()

Return length of transcription region :rtype: int

multiple_alt_loci()

Returns true if gene stretches over multiple alt loci. NB: Works only when gene is represented on a graph where alt loci blocks have names with alt loci in them

translate(T)

Translate transcription_region and exons and return new gene

Parameters:T – Translation object
Returns:Translated gene
Return type:Gene

Utility functions

Graphutils

A collection of utility functions used to work with Graphs and Translations using real data.

offsetbasedgraph.graphutils.analyze_genes_on_merged_graph(genes, translation)

Find similarities between gene annotations on alt-loci and main chromosomes when represented on a merged graph Translate the genes to the merged graph and perform analysis

Parameters:
  • genes – list of genes on original graph
  • translation – translation from original to merged graph
offsetbasedgraph.graphutils.analyze_multipath_genes_for_alt(alt_id, alt_loci_genes, main_genes, graph, name_trans, ncbi_alignments_dir)

Find genes with the same fuzzy multipath interval representation

Parameters:
  • alt_id – Alt locus id
  • alt_loci_genes – Genes on alt loci
  • main_genes – Genes on main chromosome paralell to alt loci
  • graph – Graph
  • name_trans – Translation
  • ncbi_alignments_dir – Directory to find alt-loci info
Returns:

Number of genes on alt-loci with match on main chromsome

Return type:

int

offsetbasedgraph.graphutils.create_gene_dicts(genes, alt_loci_fn, identical_names=False)

Takes a list of genes, and creates an alt loci dict and a gene name dict. The dicts index the genes on chromosome/alt_loci id

Parameters:genes – list of genes
Returns:alt loci dict, name dict, paralell_dict
Return type:defaultdict, defaultdict, defaultdict
offsetbasedgraph.graphutils.create_subgraph_around_alt_locus(graph, trans, alt_locus, padding=200000, alt_loci_fn='grch38_alt_loci.txt')

Takes a translation from an original grch38 graph (created by create_initial_grch38_graph) and uses a translation from that graph to current graph in order to filter out the subgraph around a particular alt locus.

Parameters:
  • graph – Graph to create subgraph from
  • trans – Translation from original grch38 graph to current graph
  • alt_locus – alt locus ID to create subgraph around
  • padding – Padding. Number if basepairs to include around alt locus
Returns:

Returns subgraph, a translation object and start position of the subgraph

offsetbasedgraph.graphutils.fuzzy_gene_analysis(genes, text_graph, ncbi_alignments_dir, alt_loci_filename)

Find number of genes on alt-loci that can be represented by the fuzzy multipath interval of a gene on a main chromsome when using a complex graph

Parameters:
  • genes – List of genes
  • text_graph – Graph
  • ncbi_alignments_dir – Directory alignments
  • alt_loci_filename – File with alt loci info
offsetbasedgraph.graphutils.get_alt_loci_positions(alt_loci_file_name)

Create a dict of alt loci positions read from alt_loci_file_name

Parameters:alt_loci_file_name – file name
Returns:dict{alt_loci_id: info (dict)}
Return type:dict{str: dict}
offsetbasedgraph.graphutils.get_gene_objects_as_intervals(fn, graph=None)

Returns a dict. Keys are gene names and values are intervals representing the gene

Parameters:fn – File name of file containing genes (on the format of UCSC)
Returns:
offsetbasedgraph.graphutils.parse_genes_file(genes_fn)

Parses a file containing genes (on the format of UCSC), and returns a list of dicts

Parameters:genes_fn – File name
Returns:Returns a list of genes, one dict for each gene
offsetbasedgraph.graphutils.translate_single_gene_to_aligned_graph(gene, trans)

Translate the gene to a multipath gene on trans.graph2

Parameters:
  • gene – Gene
  • trans – Translation
Return type:

MultiPathGene

offsetbasedgraph.graphutils.translate_to_fuzzy_interval(gene, trans)

Translate the gene to a fuzzy interval on graph2 of trans

Parameters:
  • gene – Gene
  • trans – Translation
Return type:

FuzzyGene

Graphcreators

This is a collection of methods for creating and modifying graphs.

offsetbasedgraph.graphcreators.connect_without_flanks(graph, alt_loci_fn, name_translation, filter_alt_loci=[])

Connects the alternative loci in the given file to the grch38 graph, without flanks.

Parameters:
  • alt_loci_fn – Filename of file containing alternative loci. One alt locus on each line. Four columns: alt_locus_id chr chr_start chr_stop
  • filter_alt_loci – If not empty, only these alt loci will be connected
Returns:

Returns the new graph and translation from graph to new graph

Return type:

(Graph, Translation)

offsetbasedgraph.graphcreators.convert_to_numeric_graph(graph)

Convert graph with str region paths id to graph with int region path ids

Parameters:graph – Graph
Returns:Graph with int region path ids + Translation
Return type:(Graph, Translation)
offsetbasedgraph.graphcreators.convert_to_text_graph(graph, name_translation, numeric_translation)

Convert graph with numeric region path ids to graph with string region path ids. The new ids are generated by merging the string region path ids that are translated to the numeric ids in name_translation

Parameters:
  • graph – Graph with numeric region path ids
  • name_translation – Translation from string ids to numeric ids
  • numeric_translation – Translation between numeric id graphs
Returns:

Graph with string ids + Translation

Return type:

(Graph, Translation)

offsetbasedgraph.graphcreators.create_initial_grch38_graph(chrom_sizes_fn)

Creates an initial grch38 graph with no connected blocks

Parameters:chrom_sizes_fn – Filename of chrom sizes file
Returns:Returns a Graph with no connected blocks
Return type:Graph
offsetbasedgraph.graphcreators.grch38_graph_to_numeric(original_grch38_graph)

Convert a GRCh38 graph with string ids (chromosome id) to isomorphic graph with numeric ids

Parameters:original_grch38_graph – Graph
Returns:Graph with numeric region path ids + Translation
Return type:(Graph, Translation)
offsetbasedgraph.graphcreators.merge_alt_using_cigar(original_numeric_grch38_graph, trans, alt_id, ncbi_alignments_dir)

Uses a cigar string from ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/001/405/GCA_000001405.15_GRCh38/GCA_000001405.15_GRCh38_assembly_structure/ and merges the alt locus.

Parameters:
  • original_numeric_grch38_graph
  • trans – Translation object from original grch38 graph to numeric (returned by grch38_graph_to_numeric)
  • alt_id – Alt id (e.g. chr2_KI270774v1_alt)
  • ncbi_alignments_dir – Directory containing preprocessed ncbi alignment files
Returns:

Returns the new graph and a translation object between the original grch38 graph and the new graph

Return type:

(Graph, Translation)

offsetbasedgraph.graphcreators.merge_flanks(intervals, final_trans, new_graph, name_translation)

Merges the start and end flanks represented in intervals on the given graph

Parameters:
  • intervals – List of start and end flanks to merge [start, start, end, end] Intervals should be on first graph (i.e. name_translation.graph1)
  • final_trans – Trans that will be updated
  • new_graph – Current graph
  • name_translation – Translation from human readable names to numeric IDs. Can be an empty translation
Returns:

Returns the new graph and translation as a tuple

Return type:

(Graph, Translation)