# Segmenters¶

Text segmentation is the process of dividing written text into meaningful units, such as words, sentences, or topics. This module provides a collection of Segmenter that can be used to produce hierarchical clusters of tokens (Segmenter) that can be understood by segment_matcher.

Segmenter
is an abstract base class that requires the implementation of a segment() function that clusters tokens into a sequences of Segment and MatchableSegment
ParagraphsSentencesAndWhitespace
implements a segment() function that clusters tokens into segments of paragraph and sentence MatchableSegment with whitespace Segment inbetween.
class deltas.ParagraphsSentencesAndWhitespace(*, whitespace=None, paragraph_end=None, sentence_end=None, min_sentence=None)

Constructs a paragraphs, sentences and whitespace segmenter. This segmenter is intended to be used in western languages where sentences and paragraphs are meaningful segments of text content.

Tree structure:

Example: >>> from deltas import ParagraphsSentencesAndWhitespace, text_split >>> from deltas.segmenters import print_tree >>> >>> a = text_split.tokenize("This comes first. This comes second.") >>> >>> segmenter = ParagraphsSentencesAndWhitespace() >>> segments = segmenter.segment(a) >>> >>> print_tree(segments) Segment: 'This comes first. This comes second.' MatchableSegment: 'This comes first. This comes second.' MatchableSegment: 'This comes first.' Segment: ' ' MatchableSegment: 'This comes second.'  whitespace : set ( str ) A set of token types that represent whitespace. paragraph_end : set ( str ) A set of token types that represent the end of a pragraph. sentence_end : set ( str) A set of tokens types that represent the end of a sentence. min_sentence : int The minimum non-whitespace tokens that a sentence must contain before a sentence_end will be entertained.
segment(tokens)

Segments a sequence of tokens into a sequence of segments.

Parameters: tokens : list ( Token )
class deltas.Segmenter

Constructs a token segmentation strategy.

classmethod from_config(config, name, section_key='segmenters')

Constructs a segmenter from a configuration doc.

segment(tokens)

Segments a sequence of Token into a iterable of Segment

## Segments¶

Segments represent subsequences of tokens that have interesting properties. All segments are based on two abstract types:

deltas.Segment
A segment of text with a start and end index that refers to the original sequence of tokens.
deltas.MatchableSegment
A segment of text that can be matched with another segment no matter where it appears in a document. Generally segmnents of this type represent a substantial collection of tokens.

### Segment Types¶

class deltas.Segment(*args, **kwargs)
end

The index of the last deltas.Token in the segment.

tokens()

generator : the tokens in this segment

class deltas.MatchableSegment(*args, **kwargs)

Constructs a segment that can be matched. Segments of this type general contain important content that might have been copied between different versions of text.