Segmenters¶
Text segmentation is the process of dividing written text into meaningful units,
such as words, sentences, or topics. This module provides a collection of
Segmenter that can be used to produce hierarchical
clusters of tokens (Segmenter) that can be
understood by segment_matcher.
Segmenter- is an abstract base class that requires the implementation of a
segment()function that clusters tokens into a sequences ofSegmentandMatchableSegment ParagraphsSentencesAndWhitespace- implements a
segment()function that clusters tokens into segments of paragraph and sentenceMatchableSegmentwith whitespaceSegmentinbetween.
-
class
deltas.ParagraphsSentencesAndWhitespace(*, whitespace=None, paragraph_end=None, sentence_end=None, min_sentence=None)¶ Constructs a paragraphs, sentences and whitespace segmenter. This segmenter is intended to be used in western languages where sentences and paragraphs are meaningful segments of text content.
Tree structure:
- whitespace :
Segment - paragraph :
MatchableSegment* sentence :MatchableSegment* whitespace :Segment
Example: >>> from deltas import ParagraphsSentencesAndWhitespace, text_split >>> from deltas.segmenters import print_tree >>> >>> a = text_split.tokenize("This comes first. This comes second.") >>> >>> segmenter = ParagraphsSentencesAndWhitespace() >>> segments = segmenter.segment(a) >>> >>> print_tree(segments) Segment: 'This comes first. This comes second.' MatchableSegment: 'This comes first. This comes second.' MatchableSegment: 'This comes first.' Segment: ' ' MatchableSegment: 'This comes second.'
Parameters: - whitespace : set ( str )
A set of token types that represent whitespace.
- paragraph_end : set ( str )
A set of token types that represent the end of a pragraph.
- sentence_end : set ( str)
A set of tokens types that represent the end of a sentence.
- min_sentence : int
The minimum non-whitespace tokens that a sentence must contain before a sentence_end will be entertained.
- whitespace :
-
class
deltas.Segmenter¶ Constructs a token segmentation strategy.
-
classmethod
from_config(config, name, section_key='segmenters')¶ Constructs a segmenter from a configuration doc.
-
classmethod
Segments¶
Segments represent subsequences of tokens that have interesting properties. All segments are based on two abstract types:
deltas.Segment- A segment of text with a
startandendindex that refers to the original sequence of tokens. deltas.MatchableSegment- A segment of text that can be matched with another segment no matter where it appears in a document. Generally segmnents of this type represent a substantial collection of tokens.
Segment Types¶
-
class
deltas.Segment(*args, **kwargs)¶ -
end¶ The index of the last
deltas.Tokenin the segment.
-
tokens()¶ generator : the tokens in this segment
-
-
class
deltas.MatchableSegment(*args, **kwargs)¶ Constructs a segment that can be matched. Segments of this type general contain important content that might have been copied between different versions of text.