Tokenizers¶
Tokenizers are used to split text content into a sequence of tokens. Extend
Tokenizer
to implement a custom tokenizer. See
also RegexTokenizer
for producing a tokenizer
based on a lexicon.
- deltas.text_split
- a
RegexTokenizer
that splits text into words, punctuation, symbols and whitespace. - deltas.wikitext_split
- a
RegexTokenizer
that splits text into words, punctuation, symbols and whitespace as well as wikitext markup elements (e.g. (‘dcurly_open’, “{{”) and (‘bold’, “’‘’”))
Classes¶
-
class
deltas.
Tokenizer
¶ Constructs a tokenizaton strategy.
-
tokenize
(text, token_class=<class 'deltas.tokenizers.token.Token'>)¶ Tokenizes a text.
-
-
class
deltas.
RegexTokenizer
(lexicon)¶ Uses a lexicon of regular expressions and names to tokenize a text string.
Token¶
Tokens represent chuncks of text that have semantic meaning. A Token class that
extends str
is provided.
-
class
deltas.
Token
(content, type=None)¶ Constructs a typed sub-string extracted from a text.
-
tokens
()¶ Returns an iterator of self. This method reflects the behavior of
deltas.Segment.tokens()
-