Module tknz
source code
The Tokenizer class takes an input stream and parses it into
tokens.
The parsing process is controlled by the character classification
sets:
-
blankspace characters: characters that mark a token boundary and are
not part of the token.
-
separator characters: characters that mark a token boundary and might
be considered tokens, depending on the value of a flag (to be
implemented).
-
valid characters: any non blankspace and non separator character.
Each byte read from the input stream is regarded as a character in the
range '\u0000' through '\u00FF'.
In addition, an instance has flags that control:
-
whether the characters of tokens are converted to lowercase.
-
whether separator characters constitute tokens. (TBD)
A typical application first constructs an instance of this class,
supplying the input stream to be tokenized, the set of blankspaces, and
the set of eparators, and then repeatedly loops, while method
has_more_tokens() returns true, calling the next_token() method.