Package tipy :: Module tknz :: Class ReverseTokenizer
[hide private]
[frames] | no frames]

Class ReverseTokenizer

source code

object --+    
         |    
 Tokenizer --+
             |
            ReverseTokenizer

Tokenize a stream from the end to the beginning.

Class Hierarchy for ReverseTokenizer
Class Hierarchy for ReverseTokenizer

Nested Classes [hide private]
    Inherited from Tokenizer
  __metaclass__
Metaclass for defining Abstract Base Classes (ABCs).
Instance Methods [hide private]
 
__init__(self, stream, lowercase=False, blankspaces=' \x0c\n\\c\r\t\x0b\xc2\x85', separators='`~!@#$%^&*()_-+=\\|]}[{";:/?.>,<\xe2\x80\xa0\xe2\x80\x9e\xe2\...)
Constructor of the ReverseTokenizer class.
source code
int
count_tokens(self)
Check the number of tokens left.
source code
int
count_chars(self)
Count the number of characters in the stream.
source code
bool
has_more_tokens(self)
Test if at least one token remains.
source code
str
next_token(self)
Retrieve the next token in the stream.
source code
float
progress(self)
Return the progress percentage.
source code
 
reset_stream(self)
Reset the offset to the end offset.
source code

Inherited from object: __delattr__, __format__, __getattribute__, __hash__, __new__, __reduce__, __reduce_ex__, __repr__, __setattr__, __sizeof__, __str__, __subclasshook__

    Inherited from Tokenizer
bool
is_blankspace(self, char)
Test if a character is a blankspace.
source code
bool
is_separator(self, char)
Test if a character is a separator.
source code
Class Variables [hide private]
  __abstractmethods__ = frozenset([])
    Inherited from Tokenizer
  _abc_cache = <_weakrefset.WeakSet object at 0x7f2a42321710>
  _abc_negative_cache = <_weakrefset.WeakSet object at 0x7f2a423...
  _abc_negative_cache_version = 39
  _abc_registry = <_weakrefset.WeakSet object at 0x7f2a42321690>
Properties [hide private]

Inherited from object: __class__

Method Details [hide private]

__init__(self, stream, lowercase=False, blankspaces=' \x0c\n\\c\r\t\x0b\xc2\x85', separators='`~!@#$%^&*()_-+=\\|]}[{";:/?.>,<\xe2\x80\xa0\xe2\x80\x9e\xe2\...)
(Constructor)

source code 

Constructor of the ReverseTokenizer class.

Parameters:
  • stream (str or io.IOBase) - The stream to tokenize. Can be a filename or any open IO stream.
  • blankspaces (str) - The characters that represent empty spaces.
  • separators (str) - The characters that separate token units (e.g. word boundaries).
Overrides: object.__init__

count_tokens(self)

source code 

Check the number of tokens left.

Returns: int
The number of tokens left.
Overrides: Tokenizer.count_tokens

count_chars(self)

source code 

Count the number of characters in the stream.

Returns: int
The number of characters in the stream.
Overrides: Tokenizer.count_chars

Note: Should return the same value as the wc Unix command.

has_more_tokens(self)

source code 

Test if at least one token remains.

Returns: bool
True or False weither there is at least one token left in the stream. (Keep in mind that the stream is tokenized from the end to the beginning).
Overrides: Tokenizer.has_more_tokens

next_token(self)

source code 

Retrieve the next token in the stream.

Returns: str
Return the next token or '' if there is no next token.
Overrides: Tokenizer.next_token

Note: As this is a reversed tokenizer the "next" token is currently what one would call the "previous" token but in the tokenizer workflow if think its more logic to call it the "next" token.

progress(self)

source code 

Return the progress percentage.

Returns: float
The tokenization progress percentage.
Overrides: Tokenizer.progress

reset_stream(self)

source code 

Reset the offset to the end offset.

Overrides: Tokenizer.reset_stream