Text segmentation is the process of dividing written text into meaningful units, such as words, sentences, or topics. This module provides a collection of Segmenter that can be used to produce hierarchical clusters of tokens (Segmenter) that can be understood by segment_matcher.

is an abstract base class that requires the implementation of a segment() function that clusters tokens into a sequences of Segment and MatchableSegment
implements a segment() function that clusters tokens into segments of paragraph and sentence MatchableSegment with whitespace Segment inbetween.
class deltas.ParagraphsSentencesAndWhitespace(*, whitespace=None, paragraph_end=None, sentence_end=None, min_sentence=None)

Constructs a paragraphs, sentences and whitespace segmenter. This segmenter is intended to be used in western languages where sentences and paragraphs are meaningful segments of text content.

Tree structure:

>>> from deltas import ParagraphsSentencesAndWhitespace, text_split
>>> from deltas.segmenters import print_tree
>>> a = text_split.tokenize("This comes first.  This comes second.")
>>> segmenter = ParagraphsSentencesAndWhitespace()
>>> segments = segmenter.segment(a)
>>> print_tree(segments)
Segment: 'This comes first.  This comes second.'
        MatchableSegment: 'This comes first.  This comes second.'
                MatchableSegment: 'This comes first.'
                Segment: '  '
                MatchableSegment: 'This comes second.'
whitespace : set ( str )

A set of token types that represent whitespace.

paragraph_end : set ( str )

A set of token types that represent the end of a pragraph.

sentence_end : set ( str)

A set of tokens types that represent the end of a sentence.

min_sentence : int

The minimum non-whitespace tokens that a sentence must contain before a sentence_end will be entertained.


Segments a sequence of tokens into a sequence of segments.

Parameters:tokens : list ( Token )
class deltas.Segmenter

Constructs a token segmentation strategy.

classmethod from_config(config, name, section_key='segmenters')

Constructs a segmenter from a configuration doc.


Segments a sequence of Token into a iterable of Segment


Segments represent subsequences of tokens that have interesting properties. All segments are based on two abstract types:

A segment of text with a start and end index that refers to the original sequence of tokens.
A segment of text that can be matched with another segment no matter where it appears in a document. Generally segmnents of this type represent a substantial collection of tokens.

Segment Types

class deltas.Segment(*args, **kwargs)

The index of the last deltas.Token in the segment.


generator : the tokens in this segment

class deltas.MatchableSegment(*args, **kwargs)

Constructs a segment that can be matched. Segments of this type general contain important content that might have been copied between different versions of text.