Tokenizers

Tokenizers are used to split text content into a sequence of tokens. Extend Tokenizer to implement a custom tokenizer. See also RegexTokenizer for producing a tokenizer based on a lexicon.

deltas.text_split
a RegexTokenizer that splits text into words, punctuation, symbols and whitespace.
deltas.wikitext_split
a RegexTokenizer that splits text into words, punctuation, symbols and whitespace as well as wikitext markup elements (e.g. (‘dcurly_open’, “{{”) and (‘bold’, “’‘’”))

Classes

class deltas.Tokenizer

Constructs a tokenizaton strategy.

tokenize(text, token_class=<class 'deltas.tokenizers.token.Token'>)

Tokenizes a text.

class deltas.RegexTokenizer(lexicon)

Uses a lexicon of regular expressions and names to tokenize a text string.

Token

Tokens represent chuncks of text that have semantic meaning. A Token class that extends str is provided.

class deltas.Token(content, type=None)

Constructs a typed sub-string extracted from a text.

tokens()

Returns an iterator of self. This method reflects the behavior of deltas.Segment.tokens()