Module tknz

source code

The Tokenizer class takes an input stream and parses it into tokens.

The parsing process is controlled by the character classification sets:

blankspace characters: characters that mark a token boundary and are not part of the token.
separator characters: characters that mark a token boundary and might be considered tokens, depending on the value of a flag (to be implemented).
valid characters: any non blankspace and non separator character.

Each byte read from the input stream is regarded as a character in the range '\u0000' through '\u00FF'.

In addition, an instance has flags that control:

whether the characters of tokens are converted to lowercase.
whether separator characters constitute tokens. (TBD)

A typical application first constructs an instance of this class, supplying the input stream to be tokenized, the set of blankspaces, and the set of eparators, and then repeatedly loops, while method has_more_tokens() returns true, calling the next_token() method.

Classes

[hide private]

Tokenizer
Abstract class for all tokenizers.

ForwardTokenizer
Tokenize a stream from the beginning to the end.

ReverseTokenizer
Tokenize a stream from the end to the beginning.

TextTokenizer
Tokenizer to tokenize a text file.

Variables

[hide private]

__package__ = 'tipy'