dsegmenter.mateseg¶

Package providing discourse segmenter for Mate dependency graphs.

dsegmenter.mateseg.__all__¶: List[str] – list of sub-modules exported by this package

dsegmenter.mateseg.__author__¶: str – package’s author

dsegmenter.mateseg.__email__¶: str – email of package’s author

dsegmenter.mateseg.__name__¶: str – package’s name

dsegmenter.mateseg.__version__¶: str – package version

class dsegmenter.mateseg.DependencyGraph(tree_str=None, cell_extractor=None, zero_based=False, cell_separator=None, top_relation_label=u'ROOT')[source]¶

address_span(start_address)[source]¶: returns the addresses of nodes (im)mediately depending on the given starting address in a dependency graph, except for the root node

annotate(iterable, field_name)[source]¶: annotate the nodes (excluding the artifical root) with an additional non-standard field, the values being provided in an iterable in linear order corresponding to the node order

deannotate(field_name)[source]¶: remove annotations of an additional non-standard field

get_dependencies_simple(address)[source]¶: returns a sorted list of the addresses of all dependencies of the node at the specified address

is_valid_parse_tree()[source]¶: check structural integrity of the parse; for the moment just check for a unique root

length()[source]¶: returns the length in tokens, i.e. the number of nodes excluding the artifical root

subgraphs(exclude_root=False)[source]¶: yields all nodes in linear order

token_span(start_address=0)[source]¶: returns the words (im)mediately depending on the given address in a dependency graph in correct linear order, except for the root node

words()[source]¶: yields all words except the implicit root node in linear order

class dsegmenter.mateseg.MateSegmenter(featgen=<function gen_features_for_segment>, model=u'/home/sidorenko/Projects/DiscourseSegmenter/dsegmenter/mateseg/data/mate.model')[source]¶

Class for perfoming discourse segmentation on constituency trees.

DEFAULT_CLASSIFIER = LinearSVC(C=1.0, class_weight=u'balanced', dual=True, fit_intercept=True, intercept_scaling=1, loss='squared_hinge', max_iter=1000, multi_class=u'ovr', penalty='l2', random_state=None, tol=0.0001, verbose=0)¶

DEFAULT_MODEL = u'/home/sidorenko/Projects/DiscourseSegmenter/dsegmenter/mateseg/data/mate.model'¶

DEFAULT_PIPELINE = Pipeline(steps=[(u'vectorizer', DictVectorizer(dtype=<type 'numpy.float64'>, separator='=', sort=True, sparse=True)), (u'var_filter', VarianceThreshold(threshold=0.0)), (u'classifier', LinearSVC(C=1.0, class_weight=u'balanced', dual=True, fit_intercept=True, intercept_scaling=1, loss='squared_hinge', max_iter=1000, multi_class=u'ovr', penalty='l2', random_state=None, tol=0.0001, verbose=0))])¶

extract_features_from_text(dep_forest, seg_forest=None)[source]¶

Extract features from dependency trees.

Parameters:	dep_forrest (list or None) – list of sentence trees to be parsed dep_forrest – list of discourse segments
Returns:	list of features and list of labels
Return type:	2-tuple[list, list]

segment(a_trees)[source]¶

Create discourse segments based on the Mate trees.

Parameters:	a_trees (list) – list of sentence trees to be parsed
Returns:	constructed segment trees
Return type:	list

segment_text(dep_forest)[source]¶

Segment all sentences of a text.

Parameters:	dep_forrest (list[dsegmenter.mateseg.dependency_graph]) – list of sentence trees to be parsed
Returns:	constructed segment trees
Return type:	list

test(trees, segments)[source]¶

Estimate performance of segmenter model.

Parameters:	a_trees (list) – BitPar trees a_segments (list) – corresponding gold segments for trees
Returns:	macro and micro-averaged F-scores
Return type:	2-tuple

train(trees, segments, path=None)[source]¶

Train segmenter model.

Parameters:	a_trees (list) – BitPar trees a_segs (list) – discourse segments a_path (str) – path to file in which the trained model should be stored
Returns:
Return type:	void

dsegmenter.mateseg.read_trees(a_lines)[source]¶

Read file and yield DependencyGraphs.

Parameters:	a_lines (list[str]) – iterable over decoded lines of the input file
Yields:	nltk.parse.dependencygraph.DependencyGraph

dsegmenter.mateseg.read_tok_trees(a_lines)[source]¶

Read file and return a mapping from tokens to trees and a list of trees.

Parameters:	a_lines (list[str]) – decoded lines of the input file
Returns:	list of dictionaries mapping tokens to trees and a list of trees
Return type:	2-tuple

dsegmenter.mateseg.trees2segs(a_toks2trees, a_toks2segs)[source]¶

Align trees with corresponding segments.

Parameters:	a_toks2trees (dict) – mapping from tokens to trees a_toks2segs (dict) – mapping from tokens to segments
Returns:	mapping from trees to segments
Return type:	dict