Tokenizers¶

Tokenizers are used to split text content into a sequence of tokens. Extend Tokenizer to implement a custom tokenizer. See also RegexTokenizer for producing a tokenizer based on a lexicon.

deltas.text_split: a RegexTokenizer that splits text into words, punctuation, symbols and whitespace.
deltas.wikitext_split: a RegexTokenizer that splits text into words, punctuation, symbols and whitespace as well as wikitext markup elements (e.g. (‘dcurly_open’, “{{”) and (‘bold’, “’‘’”))

Classes¶

class deltas.Tokenizer¶

Constructs a tokenizaton strategy.

tokenize(text, token_class=<class 'deltas.tokenizers.token.Token'>)¶: Tokenizes a text.

class deltas.RegexTokenizer(lexicon)¶: Uses a lexicon of regular expressions and names to tokenize a text string.

Token¶

Tokens represent chuncks of text that have semantic meaning. A Token class that extends str is provided.

class deltas.Token(content, type=None)¶

Constructs a typed sub-string extracted from a text.

tokens()¶: Returns an iterator of self. This method reflects the behavior of deltas.Segment.tokens()

Tokenizers¶

Classes¶

Token¶

Table Of Contents

Related Topics

This Page