Segmenters¶

Text segmentation is the process of dividing written text into meaningful units, such as words, sentences, or topics. This module provides a collection of Segmenter that can be used to produce hierarchical clusters of tokens (Segmenter) that can be understood by segment_matcher.

Segmenter: is an abstract base class that requires the implementation of a segment() function that clusters tokens into a sequences of Segment and MatchableSegment
ParagraphsSentencesAndWhitespace: implements a segment() function that clusters tokens into segments of paragraph and sentence MatchableSegment with whitespace Segment inbetween.

class deltas.ParagraphsSentencesAndWhitespace(*, whitespace=None, paragraph_end=None, sentence_end=None, min_sentence=None)¶

Constructs a paragraphs, sentences and whitespace segmenter. This segmenter is intended to be used in western languages where sentences and paragraphs are meaningful segments of text content.

Tree structure:

whitespace : Segment
paragraph : MatchableSegment * sentence : MatchableSegment * whitespace : Segment

Example:

>>> from deltas import ParagraphsSentencesAndWhitespace, text_split
>>> from deltas.segmenters import print_tree
>>>
>>> a = text_split.tokenize("This comes first.  This comes second.")
>>>
>>> segmenter = ParagraphsSentencesAndWhitespace()
>>> segments = segmenter.segment(a)
>>>
>>> print_tree(segments)
Segment: 'This comes first.  This comes second.'
        MatchableSegment: 'This comes first.  This comes second.'
                MatchableSegment: 'This comes first.'
                Segment: '  '
                MatchableSegment: 'This comes second.'

Parameters:

whitespace : set ( str ): A set of token types that represent whitespace.
paragraph_end : set ( str ): A set of token types that represent the end of a pragraph.
sentence_end : set ( str): A set of tokens types that represent the end of a sentence.
min_sentence : int: The minimum non-whitespace tokens that a sentence must contain before a sentence_end will be entertained.

segment(tokens)¶

Segments a sequence of tokens into a sequence of segments.

Parameters:	tokens : list ( `Token` )

class deltas.Segmenter¶

Constructs a token segmentation strategy.

classmethod from_config(config, name, section_key='segmenters')¶: Constructs a segmenter from a configuration doc.

segment(tokens)¶: Segments a sequence of Token into a iterable of Segment

Segments¶

Segments represent subsequences of tokens that have interesting properties. All segments are based on two abstract types:

deltas.Segment: A segment of text with a start and end index that refers to the original sequence of tokens.
deltas.MatchableSegment: A segment of text that can be matched with another segment no matter where it appears in a document. Generally segmnents of this type represent a substantial collection of tokens.

Segment Types¶

class deltas.Segment(*args, **kwargs)¶

end¶: The index of the last deltas.Token in the segment.

tokens()¶: generator : the tokens in this segment

class deltas.MatchableSegment(*args, **kwargs)¶: Constructs a segment that can be matched. Segments of this type general contain important content that might have been copied between different versions of text.

Segmenters¶

Segments¶

Segment Types¶

Table Of Contents

Related Topics

This Page