discoursegraphs Package

discoursegraphs Package

discoursegraphs.__init__.main()[source]

dg Module

The dg module specifies a DisourseDocumentGraph, the fundamential data structure used in this package. It is a slightly modified networkx.MultiDiGraph, which enforces every node and edge to have a layers attribute (which maps to the set of layers (str) it belongs to).

class discoursegraphs.dg.DiscourseDocumentGraph[source]

Bases: networkx.classes.multidigraph.MultiDiGraph

Base class for representing annotated documents as directed graphs with multiple edges.

TODO list:

  • allow layers to be a single str or set of str
  • allow adding a layer by including it in **attr
  • add consistency check that would allow adding a node that already exists in the graph, but only if the new graph has different attributes (layers can be the same though)
  • outsource layer assertions to method?
add_edge(u, v, layers, key=None, attr_dict=None, **attr)[source]

Add an edge between u and v.

An edge can only be added if the nodes u and v already exist. This decision was taken to ensure that all nodes are associated with at least one (meaningful) layer.

Edge attributes can be specified with keywords or by providing a dictionary with key/value pairs. In contrast to other edge attributes, layers can only be added not overwriten or deleted.

Parameters:
  • u,v (nodes) – Nodes can be, for example, strings or numbers. Nodes must be hashable (and not None) Python objects.
  • layers (set of str) – the set of layers the edge belongs to, e.g. {‘tiger:token’, ‘anaphoricity:annotation’}
  • key (hashable identifier, optional (default=lowest unused integer)) – Used to distinguish multiedges between a pair of nodes.
  • attr_dict (dictionary, optional (default= no attributes)) – Dictionary of edge attributes. Key/value pairs will update existing data associated with the edge.
  • attr (keyword arguments, optional) – Edge data (or labels or objects) can be assigned using keyword arguments.

See also

add_edges_from()
add a collection of edges

Notes

To replace/update edge data, use the optional key argument to identify a unique edge. Otherwise a new edge will be created.

NetworkX algorithms designed for weighted graphs cannot use multigraphs directly because it is not clear how to handle multiedge weights. Convert to Graph using edge attribute ‘weight’ to enable weighted graph algorithms.

Examples

>>> from discoursegraphs import  DiscourseDocumentGraph
>>> d = DiscourseDocumentGraph()
>>> d.add_nodes_from([(1, {'layers':{'token'}, 'word':'hello'}),                 (2, {'layers':{'token'}, 'word':'world'})])
>>> d.edges(data=True)
>>> []
>>> d.add_edge(1, 2, layers={'generic'})
>>> d.add_edge(1, 2, layers={'tokens'}, weight=0.7)
>>> d.edges(data=True)
[(1, 2, {'layers': {'generic'}}),
 (1, 2, {'layers': {'tokens'}, 'weight': 0.7})]
>>> d.edge[1][2]
{0: {'layers': {'generic'}}, 1: {'layers': {'tokens'}, 'weight': 0.7}}
>>> d.add_edge(1, 2, layers={'tokens'}, key=1, weight=1.0)
>>> d.edges(data=True)
[(1, 2, {'layers': {'generic'}}),
 (1, 2, {'layers': {'tokens'}, 'weight': 1.0})]
>>> d.add_edge(1, 2, layers={'foo'}, key=1, weight=1.0)
>>> d.edges(data=True)
[(1, 2, {'layers': {'generic'}}),
 (1, 2, {'layers': {'foo', 'tokens'}, 'weight': 1.0})]
add_edges_from(ebunch, attr_dict=None, **attr)[source]

Add all the edges in ebunch.

Parameters:
  • ebunch (container of edges) –

    Each edge given in the container will be added to the graph. The edges can be:

    • 3-tuples (u,v,d) for an edge attribute dict d, or
    • 4-tuples (u,v,k,d) for an edge identified by key k

    Each edge must have a layers attribute (set of str).

  • attr_dict (dictionary, optional (default= no attributes)) – Dictionary of edge attributes. Key/value pairs will update existing data associated with each edge.
  • attr (keyword arguments, optional) – Edge data (or labels or objects) can be assigned using keyword arguments.

See also

add_edge()
add a single edge

Notes

Adding the same edge twice has no effect but any edge data will be updated when each duplicate edge is added.

An edge can only be added if the source and target nodes are already present in the graph. This decision was taken to ensure that all edges are associated with at least one (meaningful) layer.

Edge attributes specified in edges as a tuple (in ebunch) take precedence over attributes specified otherwise (in attr_dict or attr). Layers can only be added (via a ‘layers’ edge attribute), but not overwritten.

Examples

>>> d = DiscourseDocumentGraph()
>>> d.add_node(1, {'int'})
>>> d.add_node(2, {'int'})
>>> d.add_edges_from([(1, 2, {'layers': {'int'}, 'weight': 23})])
>>> d.add_edges_from([(1, 2, {'layers': {'int'}, 'weight': 42})])
>>> d.edges(data=True) # multiple edges between the same nodes
[(1, 2, {'layers': {'int'}, 'weight': 23}),
 (1, 2, {'layers': {'int'}, 'weight': 42})]

Associate data to edges

We update the existing edge (key=0) and overwrite its ‘weight’ value. Note that we can’t overwrite the ‘layers’ value, though. Instead, they are added to the set of existing layers

>>> d.add_edges_from([(1, 2, 0, {'layers':{'number'}, 'weight':66})])
[(1, 2, {'layers': {'int', 'number'}, 'weight': 66}),
 (1, 2, {'layers': {'int'}, 'weight': 42})]
add_node(n, layers, attr_dict=None, **attr)[source]

Add a single node n and update node attributes.

Parameters:
  • n (node) – A node can be any hashable Python object except None.
  • layers (set of str) – the set of layers the node belongs to, e.g. {‘tiger:token’, ‘anaphoricity:annotation’}
  • attr_dict (dictionary, optional (default= no attributes)) – Dictionary of node attributes. Key/value pairs will update existing data associated with the node.
  • attr (keyword arguments, optional) – Set or change attributes using key=value.

See also

add_nodes_from()

Examples

>>> from discoursegraphs import DiscourseDocumentGraph
>>> d = DiscourseDocumentGraph()
>>> d.add_node(1, {'node'})

# adding the same node with a different layer >>> d.add_node(1, {‘number’}) >>> d.nodes(data=True) [(1, {‘layers’: {‘node’, ‘number’}})]

Use keywords set/change node attributes:

>>> d.add_node(1, {'node'}, size=10)
>>> d.add_node(3, layers={'num'}, weight=0.4, UTM=('13S',382))
>>> d.nodes(data=True)
[(1, {'layers': {'node', 'number'}, 'size': 10}),
 (3, {'UTM': ('13S', 382), 'layers': {'num'}, 'weight': 0.4})]

Notes

A hashable object is one that can be used as a key in a Python dictionary. This includes strings, numbers, tuples of strings and numbers, etc.

On many platforms hashable items also include mutables such as NetworkX Graphs, though one should be careful that the hash doesn’t change on mutables.

add_nodes_from(nodes, **attr)[source]

Add multiple nodes.

Parameters:
  • nodes (iterable container of (node, attribute dict) tuples.) – Node attributes are updated using the attribute dict.
  • attr (keyword arguments, optional (default= no attributes)) – Update attributes for all nodes in nodes. Node attributes specified in nodes as a tuple take precedence over attributes specified generally.

See also

add_node()

Examples

>>> d.add_nodes_from([(1, {'layers':{'token'}, 'word':'hello'}),                 (2, {'layers':{'token'}, 'word':'world'})])
>>> d.nodes(data=True)
[(1, {'layers': {'token'}, 'word': 'hello'}),
 (2, {'layers': {'token'}, 'word': 'world'})]

Use keywords to update specific node attributes for every node.

>>> d.add_nodes_from(d.nodes(data=True), weight=1.0)
>>> d.nodes(data=True)
[(1, {'layers': {'token'}, 'weight': 1.0, 'word': 'hello'}),
 (2, {'layers': {'token'}, 'weight': 1.0, 'word': 'world'})]

Use (node, attrdict) tuples to update attributes for specific nodes.

>>> d.add_nodes_from([(1, {'layers': {'tiger'}})], size=10)
>>> d.nodes(data=True)
[(1, {'layers': {'tiger', 'token'}, 'size': 10, 'weight': 1.0, 'word': 'hello'}),
 (2, {'layers': {'token'}, 'weight': 1.0, 'word': 'world'})]

merging Module

The merging module combines several document graphs into one. So far, it is able to merge rhetorical structure theory (RS3), syntax (TigerXML) and anaphora (ad-hoc format) annotations of the same document.

discoursegraphs.merging.add_anaphoricity_to_tiger(tiger_docgraph, anaphora_graph)[source]

adds an AnaphoraDocumentGraph to a TigerDocumentGraph, thereby adding information about the anaphoricity of words (e.g. ‘das’, ‘es’) to the respective (Tiger) tokens.

Parameters:
  • tiger_docgraph (TigerDocumentGraph) – multidigraph representing a syntax annotated (TigerXML) document
  • anaphora_graph (AnaphoraDocumentGraph) – multidigraph representing a anaphorcity annotated document (ad-hoc format used in Christian Dittrich’s diploma thesis)
discoursegraphs.merging.add_rst_to_tiger(tiger_docgraph, rst_graph)[source]

adds an RSTGraph to a TigerDocumentGraph, thereby adding edges from each RST segment to the (Tiger) tokens they represent.

Parameters:
  • tiger_docgraph (TigerDocumentGraph) – multidigraph representing a syntax annotated (TigerXML) document
  • rst_graph (RSTGraph) – multidigraph representing a RST annotated (RS3) document
discoursegraphs.merging.map_anaphoricity_tokens_to_tiger(tiger_docgraph, anaphora_graph)[source]

creates a map from anaphoricity token node IDs to tiger token node IDs.

Parameters:
  • tiger_docgraph (TigerDocumentGraph) – multidigraph representing a syntax annotated (TigerXML) document
  • anaphora_graph (AnaphoraDocumentGraph) – multidigraph representing a anaphorcity annotated document (ad-hoc format used in Christian Dittrich’s diploma thesis)
Returns:

anaphora2tiger – map from anaphoricity token node IDs (int) to tiger token node IDs (str, e.g. ‘s23_5’)

Return type:

dict

discoursegraphs.merging.merging_cli()[source]

simple commandline interface of the merging module.

This function is called when you use the discoursegraphs application directly on the command line.

relabel Module

This is a slightly modified version of networkx.relabel. The only difference between the two versions is that this one supports the layers attribute (which each node and edge in a DisourseDocumentGraph) must have.

discoursegraphs.relabel.convert_node_labels_to_integers(G, first_label=0, ordering='default', label_attribute=None)[source]

Return a copy of the graph G with the nodes relabeled with integers.

Parameters:
  • G (graph) – A NetworkX graph
  • first_label (int, optional (default=0)) – An integer specifying the offset in numbering nodes. The n new integer labels are numbered first_label, ..., n-1+first_label.
  • ordering (string) – “default” : inherit node ordering from G.nodes() “sorted” : inherit node ordering from sorted(G.nodes()) “increasing degree” : nodes are sorted by increasing degree “decreasing degree” : nodes are sorted by decreasing degree
  • label_attribute (string, optional (default=None)) – Name of node attribute to store old label. If None no attribute is created.

Notes

Node and edge attribute data are copied to the new (relabeled) graph.

See also

relabel_nodes()

discoursegraphs.relabel.relabel_nodes(G, mapping, copy=True)[source]

Relabel the nodes of the graph G.

Parameters:
  • G (graph) – A NetworkX graph
  • mapping (dictionary) – A dictionary with the old labels as keys and new labels as values. A partial mapping is allowed.
  • copy (bool (optional, default=True)) – If True return a copy, or if False relabel the nodes in place.

Examples

>>> G=nx.path_graph(3)  # nodes 0-1-2
>>> mapping={0:'a',1:'b',2:'c'}
>>> H=nx.relabel_nodes(G,mapping)
>>> print(sorted(H.nodes()))
['a', 'b', 'c']
>>> G=nx.path_graph(26) # nodes 0..25
>>> mapping=dict(zip(G.nodes(),"abcdefghijklmnopqrstuvwxyz"))
>>> H=nx.relabel_nodes(G,mapping) # nodes a..z
>>> mapping=dict(zip(G.nodes(),range(1,27)))
>>> G1=nx.relabel_nodes(G,mapping) # nodes 1..26

Partial in-place mapping:

>>> G=nx.path_graph(3)  # nodes 0-1-2
>>> mapping={0:'a',1:'b'} # 0->'a' and 1->'b'
>>> G=nx.relabel_nodes(G,mapping, copy=False)

print(G.nodes()) [2, ‘b’, ‘a’]

Mapping as function:

>>> G=nx.path_graph(3)
>>> def mapping(x):
...    return x**2
>>> H=nx.relabel_nodes(G,mapping)
>>> print(H.nodes())
[0, 1, 4]

Notes

Only the nodes specified in the mapping will be relabeled.

The keyword setting copy=False modifies the graph in place. This is not always possible if the mapping is circular. In that case use copy=True.

util Module

This module contains a number of helper functions.

discoursegraphs.util.ensure_unicode(str_or_unicode)[source]

tests, if the input is str or unicode. if it is str, it will be decoded from UTF-8 to unicode.

discoursegraphs.util.natural_sort_key(s)[source]

returns a key that can be used in sort functions.

Example:

>>> items = ['A99', 'a1', 'a2', 'a10', 'a24', 'a12', 'a100']

The normal sort function will ignore the natural order of the integers in the string:

>>> print sorted(items)
['A99', 'a1', 'a10', 'a100', 'a12', 'a2', 'a24']

When we use this function as a key to the sort function, the natural order of the integer is considered.

>>> print sorted(items, key=natural_sort_key)
['A99', 'a1', 'a2', 'a10', 'a12', 'a24', 'a100']

readwrite Package

The readwrite package contains importers, exporters and other output functionality. Basically, it allows you to convert annotated linguistic documents into a graph-based representation for further processing.

anaphoricity Module

The anaphoricity module parses Christian Dittrich’s anaphoricity annotation ad-hoc format into a document graph.

class discoursegraphs.readwrite.anaphoricity.AnaphoraDocumentGraph(anaphora_filepath, name=None)[source]

Bases: discoursegraphs.dg.DiscourseDocumentGraph

represents a text in which abstract anaphora were annotated as a graph.

tokens list of int

a list of node IDs (int) which represent the tokens in the order they occur in the text

root str

name of the document root node ID (default: ‘anaphoricity:root_node’)

neo4j Module

The neo4j module converts a DiscourseDocumentGraph into a Geoff string and/or exports it to a running Neo4j graph database.

discoursegraphs.readwrite.neo4j.convert_to_geoff(discoursegraph)[source]
Parameters:discoursegraph (DiscourseDocumentGraph) – the discourse document graph to be converted into GEOFF format
Returns:geoff – a geoff string representation of the discourse graph.
Return type:string
discoursegraphs.readwrite.neo4j.make_json_encodable(discoursegraph)[source]

typecasts all layers sets to lists to make the graph convertible into geoff format.

discoursegraphs.readwrite.neo4j.upload_to_neo4j(discoursegraph)[source]
Parameters:discoursegraph (DiscourseDocumentGraph) – the discourse document graph to be uploaded to the local neo4j instance/
Returns:neonx_results – list of results from the write_to_neo function of neonx.
Return type:list of dict

rst Module

This module converts an RS3 XML file (used by RSTTool to annotate rhetorical structure) into a networkx-based directed graph (DiscourseDocumentGraph).

class discoursegraphs.readwrite.rst.RSTGraph(rs3_filepath)[source]

Bases: discoursegraphs.dg.DiscourseDocumentGraph

A directed graph with multiple edges (based on a networkx MultiDiGraph) that represents the rhetorical structure of a document.

__str__()[source]

string representation of an RSTGraph (contains filename, allowed relations and tokenization status).

discoursegraphs.readwrite.rst.extract_relationtypes(rs3_xml_tree)[source]

extracts the allowed RST relation names and relation types from an RS3 XML file.

Parameters:rs3_xml_tree (lxml.etree._ElementTree) – lxml ElementTree representation of an RS3 XML file
Returns:relations – Returns a dictionary with RST relation names as keys (str) and relation types (either ‘rst’ or ‘multinuc’) as values (str).
Return type:dict of (str, str)
discoursegraphs.readwrite.rst.rst_tokenlist(rst_graph)[source]

extracts all tokens from an RSTGraph.

Parameters:rst_graph (RSTGraph) – a directed graph representing an RST tree
Returns:all_rst_tokens – a list of (str, str) tuples, where the first element is the token and the second one is the segment node ID it belongs to.
Return type:tuple of (unicode, str)
discoursegraphs.readwrite.rst.sanitize_string(string_or_unicode)[source]

remove leading/trailing whitespace and always return unicode.

tiger Module

The tiger module converts a TigerXML file into a networkx-based document graph.

class discoursegraphs.readwrite.tiger.TigerDocumentGraph(tiger_filepath, name=None)[source]

Bases: discoursegraphs.dg.DiscourseDocumentGraph

A directed graph with multiple edges (based on networkx.MultiDiGraph) that represents all the sentences contained in a TigerXML file. A TigerDocumentGraph contains a document root node (whose ID is stored in self.root), which has an outgoing edge to the sentence root nodes of each sentence.

corpus_id str

ID of the TigerXML document specified in the ‘id’ attribute of the <corpus> element

root str

the ID of the root node of the document graph

sentences list of str

sorted list of all sentence root node IDs (of sentences contained in this document graph)

The attribute dict of each sentence root node contains a key tokens, which maps to a sorted list of token node IDs (str). To print all tokens of a Tiger document, just do:

tdg = TigerDocumentGraph('/path/to/tiger.file')
for sentence_root_node in tdg.sentences:
    for token_node_id in tdg.node[sentence_root_node]['tokens']:
        print tdg.node[token_node_id]['tiger:word']
class discoursegraphs.readwrite.tiger.TigerSentenceGraph(sentence)[source]

Bases: discoursegraphs.dg.DiscourseDocumentGraph

A directed graph (based on a networkx.MultiDiGraph) that represents one syntax annotated sentence extracted from a TigerXML file.

root str

node ID of the root node of the sentence

tokens list of str

a sorted list of terminal node IDs (i.e. token nodes)

discoursegraphs.readwrite.tiger.add_prefix(dict_like, prefix)[source]

takes a dict (or dict-like object, e.g. etree._Attrib) and adds the given prefix to each key. Always returns a dict (via a typecast).

Parameters:
  • dict_like (dict (or similar)) – a dictionary or a container that implements .items()
  • prefix (str) – the prefix string to be prepended to each key in the input dict
Returns:

prefixed_dict – A dict, in which each key begins with the given prefix.

Return type:

dict

discoursegraphs.readwrite.tiger.get_unconnected_nodes(sentence_graph)[source]

Takes a TigerSentenceGraph and returns a list of node IDs of unconnected nodes.

A node is unconnected, if it doesn’t have any in- or outgoing edges. A node is NOT considered unconnected, if the graph only consists of that particular node.

Parameters:sentence_graph (TigerSentenceGraph) – a directed graph representing one syntax annotated sentence from a TigerXML file
Returns:unconnected_node_ids – a list of node IDs of unconnected nodes
Return type:list of str
discoursegraphs.readwrite.tiger.tiger_tokenlist(tigerdoc_graph)[source]

extracts all tokens from a TigerDocumentGraph.

Parameters:tigerdoc_graph (TigerDocumentGraph) – a directed graph representing a TigerXML file and all the annotated sentences found in it.
Returns:all_tiger_tokens – a list of (unicode, str, str) tuples, where the first element is the token, the second is the sentence root node ID (of the) corresponding sentence and the third is the token node ID.
Return type:tuple of (unicode, str, str)