dsegmenter.mateseg

Package providing discourse segmenter for Mate dependency graphs.

dsegmenter.mateseg.__all__

List[str] – list of sub-modules exported by this package

dsegmenter.mateseg.__author__

str – package’s author

dsegmenter.mateseg.__email__

str – email of package’s author

dsegmenter.mateseg.__name__

str – package’s name

dsegmenter.mateseg.__version__

str – package version

class dsegmenter.mateseg.DependencyGraph(tree_str=None, cell_extractor=None, zero_based=False, cell_separator=None, top_relation_label=u'ROOT')[source]
address_span(start_address)[source]

returns the addresses of nodes (im)mediately depending on the given starting address in a dependency graph, except for the root node

annotate(iterable, field_name)[source]

annotate the nodes (excluding the artifical root) with an additional non-standard field, the values being provided in an iterable in linear order corresponding to the node order

deannotate(field_name)[source]

remove annotations of an additional non-standard field

get_dependencies_simple(address)[source]

returns a sorted list of the addresses of all dependencies of the node at the specified address

is_valid_parse_tree()[source]

check structural integrity of the parse; for the moment just check for a unique root

length()[source]

returns the length in tokens, i.e. the number of nodes excluding the artifical root

subgraphs(exclude_root=False)[source]

yields all nodes in linear order

token_span(start_address=0)[source]

returns the words (im)mediately depending on the given address in a dependency graph in correct linear order, except for the root node

words()[source]

yields all words except the implicit root node in linear order

class dsegmenter.mateseg.MateSegmenter(featgen=<function gen_features_for_segment>, model=u'/home/sidorenko/Projects/DiscourseSegmenter/dsegmenter/mateseg/data/mate.model')[source]

Class for perfoming discourse segmentation on constituency trees.

DEFAULT_CLASSIFIER = LinearSVC(C=1.0, class_weight=u'balanced', dual=True, fit_intercept=True, intercept_scaling=1, loss='squared_hinge', max_iter=1000, multi_class=u'ovr', penalty='l2', random_state=None, tol=0.0001, verbose=0)
DEFAULT_MODEL = u'/home/sidorenko/Projects/DiscourseSegmenter/dsegmenter/mateseg/data/mate.model'
DEFAULT_PIPELINE = Pipeline(steps=[(u'vectorizer', DictVectorizer(dtype=<type 'numpy.float64'>, separator='=', sort=True, sparse=True)), (u'var_filter', VarianceThreshold(threshold=0.0)), (u'classifier', LinearSVC(C=1.0, class_weight=u'balanced', dual=True, fit_intercept=True, intercept_scaling=1, loss='squared_hinge', max_iter=1000, multi_class=u'ovr', penalty='l2', random_state=None, tol=0.0001, verbose=0))])
extract_features_from_text(dep_forest, seg_forest=None)[source]

Extract features from dependency trees.

Parameters:
  • dep_forrest (list or None) – list of sentence trees to be parsed
  • dep_forrest – list of discourse segments
Returns:

list of features and list of labels

Return type:

2-tuple[list, list]

segment(a_trees)[source]

Create discourse segments based on the Mate trees.

Parameters:a_trees (list) – list of sentence trees to be parsed
Returns:constructed segment trees
Return type:list
segment_text(dep_forest)[source]

Segment all sentences of a text.

Parameters:dep_forrest (list[dsegmenter.mateseg.dependency_graph]) – list of sentence trees to be parsed
Returns:constructed segment trees
Return type:list
test(trees, segments)[source]

Estimate performance of segmenter model.

Parameters:
  • a_trees (list) – BitPar trees
  • a_segments (list) – corresponding gold segments for trees
Returns:

macro and micro-averaged F-scores

Return type:

2-tuple

train(trees, segments, path=None)[source]

Train segmenter model.

Parameters:
  • a_trees (list) – BitPar trees
  • a_segs (list) – discourse segments
  • a_path (str) – path to file in which the trained model should be stored
Returns:

Return type:

void

dsegmenter.mateseg.read_trees(a_lines)[source]

Read file and yield DependencyGraphs.

Parameters:a_lines (list[str]) – iterable over decoded lines of the input file
Yields:nltk.parse.dependencygraph.DependencyGraph
dsegmenter.mateseg.read_tok_trees(a_lines)[source]

Read file and return a mapping from tokens to trees and a list of trees.

Parameters:a_lines (list[str]) – decoded lines of the input file
Returns:list of dictionaries mapping tokens to trees and a list of trees
Return type:2-tuple
dsegmenter.mateseg.trees2segs(a_toks2trees, a_toks2segs)[source]

Align trees with corresponding segments.

Parameters:
  • a_toks2trees (dict) – mapping from tokens to trees
  • a_toks2segs (dict) – mapping from tokens to segments
Returns:

mapping from trees to segments

Return type:

dict