Segmenters

Text segmentation is the process of dividing written text into meaningful units, such as words, sentences, or topics. This module provides a collection of Segmenter that can be used to produce hierarchical clusters of tokens (Segmenter) that can be understood by segment_matcher.

Segmenter
is an abstract base class that requires the implementation of a segment() function that clusters tokens into a sequences of Segment and MatchableSegment
ParagraphsSentencesAndWhitespace
implements a segment() function that clusters tokens into segments of paragraph and sentence MatchableSegment with whitespace Segment inbetween.
class deltas.ParagraphsSentencesAndWhitespace(*, whitespace=None, paragraph_end=None, sentence_end=None, min_sentence=None)

Constructs a paragraphs, sentences and whitespace segmenter. This segmenter is intended to be used in western languages where sentences and paragraphs are meaningful segments of text content.

Tree structure:

Example:
>>> from deltas import ParagraphsSentencesAndWhitespace, text_split
>>> from deltas.segmenters import print_tree
>>>
>>> a = text_split.tokenize("This comes first.  This comes second.")
>>>
>>> segmenter = ParagraphsSentencesAndWhitespace()
>>> segments = segmenter.segment(a)
>>>
>>> print_tree(segments)
Segment: 'This comes first.  This comes second.'
        MatchableSegment: 'This comes first.  This comes second.'
                MatchableSegment: 'This comes first.'
                Segment: '  '
                MatchableSegment: 'This comes second.'
Parameters:
whitespace : set ( str )

A set of token types that represent whitespace.

paragraph_end : set ( str )

A set of token types that represent the end of a pragraph.

sentence_end : set ( str)

A set of tokens types that represent the end of a sentence.

min_sentence : int

The minimum non-whitespace tokens that a sentence must contain before a sentence_end will be entertained.

segment(tokens)

Segments a sequence of tokens into a sequence of segments.

Parameters:tokens : list ( Token )
class deltas.Segmenter

Constructs a token segmentation strategy.

classmethod from_config(config, name, section_key='segmenters')

Constructs a segmenter from a configuration doc.

segment(tokens)

Segments a sequence of Token into a iterable of Segment

Segments

Segments represent subsequences of tokens that have interesting properties. All segments are based on two abstract types:

deltas.Segment
A segment of text with a start and end index that refers to the original sequence of tokens.
deltas.MatchableSegment
A segment of text that can be matched with another segment no matter where it appears in a document. Generally segmnents of this type represent a substantial collection of tokens.

Segment Types

class deltas.Segment(*args, **kwargs)
end

The index of the last deltas.Token in the segment.

tokens()

generator : the tokens in this segment

class deltas.MatchableSegment(*args, **kwargs)

Constructs a segment that can be matched. Segments of this type general contain important content that might have been copied between different versions of text.