Creating a graph of contigs sharing readsΒΆ

Whenever you import relationships between reads and contigs (such as an ACE file using the mdb-import-ACE-alignments tool; see EMPTY) you end up with cases where two contigs may share a read. I.e., the same read was used to create two (or more) contigs.

This situation can arise when the assembly algorithm couldn’t decide in which contig to assign a read. MetagenomeDB can easily track those events and allow you to visualize them with additional tools, such as Cytoscape.

Here is the code, with comments, that will generate such visualization:

#!/usr/bin/env python

### Parameters for this script; you can modify them at will:
# collection of contigs to extract a network from
contigs_collection = "my collection"

# filename for the Cytoscape-formatted network
network_fn = "network.sif"

# filename for the edges annotations (fraction of shared reads, as the Jaccard similarity)
edges_annotations_fn = "edges_annotations.txt"

# filename for the vertices annotations (length of the contig)
vertices_annotations_fn = "vertices_annotations.txt"
###

# first step, we import the MetagenomeDB API.
import MetagenomeDB as mdb

# we also import some other libraries
import sys

# then we check if the contigs' collection exists
cc = mdb.Collection.find_one({"name": contigs_collection})

if (cc == None):
        print >>sys.stderr, "ERROR: no collection found with name '%s'" % contigs_collection
        sys.exit(1)

# we retrieve a list of all contigs in this collection; in our
# example we request all sequences in the collection with a value
# 'contig' for the property 'class'. However, this is optional if
# you do know your collection only contains contigs.
contigs = list(cc.list_sequences({"class": "contig"}))

print "number of contigs to consider: %s" % len(contigs)

# we create a dictionary to store edges
edges = {}

def order (a, b):
        if (a > b):
                return b, a
        else:
                return a, b

# we now iterate through all non-redundant pairs of contigs:
n = len(contigs) * (len(contigs) - 1) / 2
c = 0

print "number of pairs of contigs to consider: %s" % n

for i, contig_a in enumerate(contigs):

        # for each contig in contig_a we request the list of the
        # reads that has been used for its assembly. Those are
        # sequences that are related to the contig and have (for
        # example) a value 'read' for the property 'class'
        reads_a = contig_a.list_related_sequences(mdb.Direction.INGOING, {"class": "read"})

        for contig_b in contigs[i+1:]:

                # similarily, we request the reads for contig_b
                reads_b = contig_b.list_related_sequences(mdb.Direction.INGOING, {"class": "read"})

                # for speed purpose in the computation below we will
                # transform this later list of reads into a dictionary
                reads_b = {}.fromkeys(reads_b)

                # we then check if those two lists overlap; we also
                # calculate the Jaccard similarity, as the cardinal
                # of the intersection of the two lists divided by
                # the cardinal of their union.
                union, intersection = {}, {}

                for item in reads_a:
                        if (item in reads_b):
                                intersection[item] = True

                        union[item] = True

                for item in reads_b:
                        union[item] = True

                jaccard_similarity = 100.0 * len(intersection) / len(union)

                # if we do have some overlap, then we store this edge
                if (jaccard_similarity > 0):
                        edges[order(contig_a, contig_b)] = jaccard_similarity

                # we display some information about our progress
                c += 1
                p = 100.0 * c / n
                if (p % 5 == 0):
                        print "%d%% done" % p

# we now save those edges and their annotations (Jaccard similarity)
network_fh = open(network_fn, 'w')
edges_annotations_fh = open(edges_annotations_fn, 'w')
print >>edges_annotations_fh, "shared_reads"

nodes = {}
for (contig_a, contig_b) in sorted(edges.keys()):
        print >>network_fh, "%s shared_reads    %s" % (contig_a["name"], contig_b["name"])
        print >>edges_annotations_fh, "%s (shared_reads) %s = %s" % (contig_a["name"], contig_b["name"], edges[(contig_a, contig_b)])

        nodes[contig_a] = True
        nodes[contig_b] = True

# finally, we store the nodes annotations (contigs' length)
vertices_annotations_fh = open(vertices_annotations_fn, 'w')
print >>vertices_annotations_fh, "contig_length"

for contig in sorted(nodes.keys()):
        print >>vertices_annotations_fh, "%s = %s" % (contig["name"], contig["length"])

This code is also available as a download.

Once this script is executed the resulting network.sif file can be loaded into Cytoscape, and the graph decorated using the annotations stored in edges_annotations.txt and vertices_annotations.txt.

Previous topic

Creating a graph of collections

Next topic

Querying BLAST results