Segmenters¶
Text segmentation is the process of dividing written text into meaningful units,
such as words, sentences, or topics. This module provides a collection of
Segmenter
that can be used to produce hierarchical
clusters of tokens (Segmenter
) that can be
understood by segment_matcher
.
Segmenter
- is an abstract base class that requires the implementation of a
segment()
function that clusters tokens into a sequences ofSegment
andMatchableSegment
ParagraphsSentencesAndWhitespace
- implements a
segment()
function that clusters tokens into segments of paragraph and sentenceMatchableSegment
with whitespaceSegment
inbetween.
-
class
deltas.
ParagraphsSentencesAndWhitespace
(*, whitespace=None, paragraph_end=None, sentence_end=None, min_sentence=None)¶ Constructs a paragraphs, sentences and whitespace segmenter. This segmenter is intended to be used in western languages where sentences and paragraphs are meaningful segments of text content.
Tree structure:
- whitespace :
Segment
- paragraph :
MatchableSegment
* sentence :MatchableSegment
* whitespace :Segment
Example: >>> from deltas import ParagraphsSentencesAndWhitespace, text_split >>> from deltas.segmenters import print_tree >>> >>> a = text_split.tokenize("This comes first. This comes second.") >>> >>> segmenter = ParagraphsSentencesAndWhitespace() >>> segments = segmenter.segment(a) >>> >>> print_tree(segments) Segment: 'This comes first. This comes second.' MatchableSegment: 'This comes first. This comes second.' MatchableSegment: 'This comes first.' Segment: ' ' MatchableSegment: 'This comes second.'
Parameters: - whitespace : set ( str )
A set of token types that represent whitespace.
- paragraph_end : set ( str )
A set of token types that represent the end of a pragraph.
- sentence_end : set ( str)
A set of tokens types that represent the end of a sentence.
- min_sentence : int
The minimum non-whitespace tokens that a sentence must contain before a sentence_end will be entertained.
- whitespace :
-
class
deltas.
Segmenter
¶ Constructs a token segmentation strategy.
-
classmethod
from_config
(config, name, section_key='segmenters')¶ Constructs a segmenter from a configuration doc.
-
classmethod
Segments¶
Segments represent subsequences of tokens that have interesting properties. All segments are based on two abstract types:
deltas.Segment
- A segment of text with a
start
andend
index that refers to the original sequence of tokens. deltas.MatchableSegment
- A segment of text that can be matched with another segment no matter where it appears in a document. Generally segmnents of this type represent a substantial collection of tokens.
Segment Types¶
-
class
deltas.
Segment
(*args, **kwargs)¶ -
end
¶ The index of the last
deltas.Token
in the segment.
-
tokens
()¶ generator : the tokens in this segment
-
-
class
deltas.
MatchableSegment
(*args, **kwargs)¶ Constructs a segment that can be matched. Segments of this type general contain important content that might have been copied between different versions of text.