The dg module specifies a DisourseDocumentGraph, the fundamential data structure used in this package. It is a slightly modified networkx.MultiDiGraph, which enforces every node and edge to have a layers attribute (which maps to the set of layers (str) it belongs to).
Bases: networkx.classes.multidigraph.MultiDiGraph
Base class for representing annotated documents as directed graphs with multiple edges.
TODO list:
Add an edge between u and v.
An edge can only be added if the nodes u and v already exist. This decision was taken to ensure that all nodes are associated with at least one (meaningful) layer.
Edge attributes can be specified with keywords or by providing a dictionary with key/value pairs. In contrast to other edge attributes, layers can only be added not overwriten or deleted.
Parameters: |
|
---|
See also
Notes
To replace/update edge data, use the optional key argument to identify a unique edge. Otherwise a new edge will be created.
NetworkX algorithms designed for weighted graphs cannot use multigraphs directly because it is not clear how to handle multiedge weights. Convert to Graph using edge attribute ‘weight’ to enable weighted graph algorithms.
Examples
>>> from discoursegraphs import DiscourseDocumentGraph
>>> d = DiscourseDocumentGraph()
>>> d.add_nodes_from([(1, {'layers':{'token'}, 'word':'hello'}), (2, {'layers':{'token'}, 'word':'world'})])
>>> d.edges(data=True)
>>> []
>>> d.add_edge(1, 2, layers={'generic'})
>>> d.add_edge(1, 2, layers={'tokens'}, weight=0.7)
>>> d.edges(data=True)
[(1, 2, {'layers': {'generic'}}),
(1, 2, {'layers': {'tokens'}, 'weight': 0.7})]
>>> d.edge[1][2]
{0: {'layers': {'generic'}}, 1: {'layers': {'tokens'}, 'weight': 0.7}}
>>> d.add_edge(1, 2, layers={'tokens'}, key=1, weight=1.0)
>>> d.edges(data=True)
[(1, 2, {'layers': {'generic'}}),
(1, 2, {'layers': {'tokens'}, 'weight': 1.0})]
>>> d.add_edge(1, 2, layers={'foo'}, key=1, weight=1.0)
>>> d.edges(data=True)
[(1, 2, {'layers': {'generic'}}),
(1, 2, {'layers': {'foo', 'tokens'}, 'weight': 1.0})]
Add all the edges in ebunch.
Parameters: |
|
---|
See also
Notes
Adding the same edge twice has no effect but any edge data will be updated when each duplicate edge is added.
An edge can only be added if the source and target nodes are already present in the graph. This decision was taken to ensure that all edges are associated with at least one (meaningful) layer.
Edge attributes specified in edges as a tuple (in ebunch) take precedence over attributes specified otherwise (in attr_dict or attr). Layers can only be added (via a ‘layers’ edge attribute), but not overwritten.
Examples
>>> d = DiscourseDocumentGraph()
>>> d.add_node(1, {'int'})
>>> d.add_node(2, {'int'})
>>> d.add_edges_from([(1, 2, {'layers': {'int'}, 'weight': 23})])
>>> d.add_edges_from([(1, 2, {'layers': {'int'}, 'weight': 42})])
>>> d.edges(data=True) # multiple edges between the same nodes
[(1, 2, {'layers': {'int'}, 'weight': 23}),
(1, 2, {'layers': {'int'}, 'weight': 42})]
Associate data to edges
We update the existing edge (key=0) and overwrite its ‘weight’ value. Note that we can’t overwrite the ‘layers’ value, though. Instead, they are added to the set of existing layers
>>> d.add_edges_from([(1, 2, 0, {'layers':{'number'}, 'weight':66})])
[(1, 2, {'layers': {'int', 'number'}, 'weight': 66}),
(1, 2, {'layers': {'int'}, 'weight': 42})]
Add a single node n and update node attributes.
Parameters: |
|
---|
See also
Examples
>>> from discoursegraphs import DiscourseDocumentGraph
>>> d = DiscourseDocumentGraph()
>>> d.add_node(1, {'node'})
# adding the same node with a different layer >>> d.add_node(1, {‘number’}) >>> d.nodes(data=True) [(1, {‘layers’: {‘node’, ‘number’}})]
Use keywords set/change node attributes:
>>> d.add_node(1, {'node'}, size=10)
>>> d.add_node(3, layers={'num'}, weight=0.4, UTM=('13S',382))
>>> d.nodes(data=True)
[(1, {'layers': {'node', 'number'}, 'size': 10}),
(3, {'UTM': ('13S', 382), 'layers': {'num'}, 'weight': 0.4})]
Notes
A hashable object is one that can be used as a key in a Python dictionary. This includes strings, numbers, tuples of strings and numbers, etc.
On many platforms hashable items also include mutables such as NetworkX Graphs, though one should be careful that the hash doesn’t change on mutables.
Add multiple nodes.
Parameters: |
|
---|
See also
Examples
>>> d.add_nodes_from([(1, {'layers':{'token'}, 'word':'hello'}), (2, {'layers':{'token'}, 'word':'world'})])
>>> d.nodes(data=True)
[(1, {'layers': {'token'}, 'word': 'hello'}),
(2, {'layers': {'token'}, 'word': 'world'})]
Use keywords to update specific node attributes for every node.
>>> d.add_nodes_from(d.nodes(data=True), weight=1.0)
>>> d.nodes(data=True)
[(1, {'layers': {'token'}, 'weight': 1.0, 'word': 'hello'}),
(2, {'layers': {'token'}, 'weight': 1.0, 'word': 'world'})]
Use (node, attrdict) tuples to update attributes for specific nodes.
>>> d.add_nodes_from([(1, {'layers': {'tiger'}})], size=10)
>>> d.nodes(data=True)
[(1, {'layers': {'tiger', 'token'}, 'size': 10, 'weight': 1.0, 'word': 'hello'}),
(2, {'layers': {'token'}, 'weight': 1.0, 'word': 'world'})]
The merging module combines several document graphs into one. So far, it is able to merge rhetorical structure theory (RS3), syntax (TigerXML) and anaphora (ad-hoc format) annotations of the same document.
adds an AnaphoraDocumentGraph to a TigerDocumentGraph, thereby adding information about the anaphoricity of words (e.g. ‘das’, ‘es’) to the respective (Tiger) tokens.
Parameters: |
|
---|
adds an RSTGraph to a TigerDocumentGraph, thereby adding edges from each RST segment to the (Tiger) tokens they represent.
Parameters: |
|
---|
creates a map from anaphoricity token node IDs to tiger token node IDs.
Parameters: |
|
---|---|
Returns: | anaphora2tiger – map from anaphoricity token node IDs (int) to tiger token node IDs (str, e.g. ‘s23_5’) |
Return type: | dict |
This is a slightly modified version of networkx.relabel. The only difference between the two versions is that this one supports the layers attribute (which each node and edge in a DisourseDocumentGraph) must have.
Return a copy of the graph G with the nodes relabeled with integers.
Parameters: |
|
---|
Notes
Node and edge attribute data are copied to the new (relabeled) graph.
See also
Relabel the nodes of the graph G.
Parameters: |
|
---|
Examples
>>> G=nx.path_graph(3) # nodes 0-1-2
>>> mapping={0:'a',1:'b',2:'c'}
>>> H=nx.relabel_nodes(G,mapping)
>>> print(sorted(H.nodes()))
['a', 'b', 'c']
>>> G=nx.path_graph(26) # nodes 0..25
>>> mapping=dict(zip(G.nodes(),"abcdefghijklmnopqrstuvwxyz"))
>>> H=nx.relabel_nodes(G,mapping) # nodes a..z
>>> mapping=dict(zip(G.nodes(),range(1,27)))
>>> G1=nx.relabel_nodes(G,mapping) # nodes 1..26
Partial in-place mapping:
>>> G=nx.path_graph(3) # nodes 0-1-2
>>> mapping={0:'a',1:'b'} # 0->'a' and 1->'b'
>>> G=nx.relabel_nodes(G,mapping, copy=False)
print(G.nodes()) [2, ‘b’, ‘a’]
Mapping as function:
>>> G=nx.path_graph(3)
>>> def mapping(x):
... return x**2
>>> H=nx.relabel_nodes(G,mapping)
>>> print(H.nodes())
[0, 1, 4]
Notes
Only the nodes specified in the mapping will be relabeled.
The keyword setting copy=False modifies the graph in place. This is not always possible if the mapping is circular. In that case use copy=True.
See also
This module contains a number of helper functions.
tests, if the input is str or unicode. if it is str, it will be decoded from UTF-8 to unicode.
returns a key that can be used in sort functions.
Example:
>>> items = ['A99', 'a1', 'a2', 'a10', 'a24', 'a12', 'a100']
The normal sort function will ignore the natural order of the integers in the string:
>>> print sorted(items)
['A99', 'a1', 'a10', 'a100', 'a12', 'a2', 'a24']
When we use this function as a key to the sort function, the natural order of the integer is considered.
>>> print sorted(items, key=natural_sort_key)
['A99', 'a1', 'a2', 'a10', 'a12', 'a24', 'a100']
The readwrite package contains importers, exporters and other output functionality. Basically, it allows you to convert annotated linguistic documents into a graph-based representation for further processing.
The anaphoricity module parses Christian Dittrich’s anaphoricity annotation ad-hoc format into a document graph.
Bases: discoursegraphs.dg.DiscourseDocumentGraph
represents a text in which abstract anaphora were annotated as a graph.
a list of node IDs (int) which represent the tokens in the order they occur in the text
name of the document root node ID (default: ‘anaphoricity:root_node’)
The neo4j module converts a DiscourseDocumentGraph into a Geoff string and/or exports it to a running Neo4j graph database.
Parameters: | discoursegraph (DiscourseDocumentGraph) – the discourse document graph to be converted into GEOFF format |
---|---|
Returns: | geoff – a geoff string representation of the discourse graph. |
Return type: | string |
typecasts all layers sets to lists to make the graph convertible into geoff format.
Parameters: | discoursegraph (DiscourseDocumentGraph) – the discourse document graph to be uploaded to the local neo4j instance/ |
---|---|
Returns: | neonx_results – list of results from the write_to_neo function of neonx. |
Return type: | list of dict |
This module converts an RS3 XML file (used by RSTTool to annotate rhetorical structure) into a networkx-based directed graph (DiscourseDocumentGraph).
Bases: discoursegraphs.dg.DiscourseDocumentGraph
A directed graph with multiple edges (based on a networkx MultiDiGraph) that represents the rhetorical structure of a document.
extracts the allowed RST relation names and relation types from an RS3 XML file.
Parameters: | rs3_xml_tree (lxml.etree._ElementTree) – lxml ElementTree representation of an RS3 XML file |
---|---|
Returns: | relations – Returns a dictionary with RST relation names as keys (str) and relation types (either ‘rst’ or ‘multinuc’) as values (str). |
Return type: | dict of (str, str) |
extracts all tokens from an RSTGraph.
Parameters: | rst_graph (RSTGraph) – a directed graph representing an RST tree |
---|---|
Returns: | all_rst_tokens – a list of (str, str) tuples, where the first element is the token and the second one is the segment node ID it belongs to. |
Return type: | tuple of (unicode, str) |
The tiger module converts a TigerXML file into a networkx-based document graph.
Bases: discoursegraphs.dg.DiscourseDocumentGraph
A directed graph with multiple edges (based on networkx.MultiDiGraph) that represents all the sentences contained in a TigerXML file. A TigerDocumentGraph contains a document root node (whose ID is stored in self.root), which has an outgoing edge to the sentence root nodes of each sentence.
ID of the TigerXML document specified in the ‘id’ attribute of the <corpus> element
the ID of the root node of the document graph
sorted list of all sentence root node IDs (of sentences contained in this document graph)
The attribute dict of each sentence root node contains a key tokens, which maps to a sorted list of token node IDs (str). To print all tokens of a Tiger document, just do:
tdg = TigerDocumentGraph('/path/to/tiger.file')
for sentence_root_node in tdg.sentences:
for token_node_id in tdg.node[sentence_root_node]['tokens']:
print tdg.node[token_node_id]['tiger:word']
Bases: discoursegraphs.dg.DiscourseDocumentGraph
A directed graph (based on a networkx.MultiDiGraph) that represents one syntax annotated sentence extracted from a TigerXML file.
node ID of the root node of the sentence
a sorted list of terminal node IDs (i.e. token nodes)
takes a dict (or dict-like object, e.g. etree._Attrib) and adds the given prefix to each key. Always returns a dict (via a typecast).
Parameters: |
|
---|---|
Returns: | prefixed_dict – A dict, in which each key begins with the given prefix. |
Return type: | dict |
Takes a TigerSentenceGraph and returns a list of node IDs of unconnected nodes.
A node is unconnected, if it doesn’t have any in- or outgoing edges. A node is NOT considered unconnected, if the graph only consists of that particular node.
Parameters: | sentence_graph (TigerSentenceGraph) – a directed graph representing one syntax annotated sentence from a TigerXML file |
---|---|
Returns: | unconnected_node_ids – a list of node IDs of unconnected nodes |
Return type: | list of str |
extracts all tokens from a TigerDocumentGraph.
Parameters: | tigerdoc_graph (TigerDocumentGraph) – a directed graph representing a TigerXML file and all the annotated sentences found in it. |
---|---|
Returns: | all_tiger_tokens – a list of (unicode, str, str) tuples, where the first element is the token, the second is the sentence root node ID (of the) corresponding sentence and the third is the token node ID. |
Return type: | tuple of (unicode, str, str) |