Package tipy :: Module tknz :: Class ReverseTokenizer

Class ReverseTokenizer

object --+    
         |    
 Tokenizer --+
             |
            ReverseTokenizer

Tokenize a stream from the end to the beginning.

Class Hierarchy for ReverseTokenizer

Nested Classes

[hide private]

Inherited from Tokenizer

__metaclass__
Metaclass for defining Abstract Base Classes (ABCs).

Instance Methods

[hide private]

__init__(self, stream, lowercase=False, blankspaces=' \x0c\n\\c\r\t\x0b\xc2\x85', separators='`~!@#$%^&*()_-+=\\|]}[{";:/?.>,<\xe2\x80\xa0\xe2\x80\x9e\xe2\...)
Constructor of the ReverseTokenizer class. source code

int

count_tokens(self)
Check the number of tokens left.

source code

int

count_chars(self)
Count the number of characters in the stream.

source code

bool

has_more_tokens(self)
Test if at least one token remains.

source code

str

next_token(self)
Retrieve the next token in the stream.

source code

float

progress(self)
Return the progress percentage.

source code

reset_stream(self)
Reset the offset to the end offset.

source code

Inherited from object: __delattr__, __format__, __getattribute__, __hash__, __new__, __reduce__, __reduce_ex__, __repr__, __setattr__, __sizeof__, __str__, __subclasshook__

Inherited from Tokenizer

bool

is_blankspace(self, char)
Test if a character is a blankspace.

source code

bool

is_separator(self, char)
Test if a character is a separator.

source code

Class Variables

[hide private]

__abstractmethods__ = frozenset([])

Inherited from Tokenizer

_abc_cache = <_weakrefset.WeakSet object at 0x7f2a42321710>

_abc_negative_cache = <_weakrefset.WeakSet object at 0x7f2a423...

_abc_negative_cache_version = 39

_abc_registry = <_weakrefset.WeakSet object at 0x7f2a42321690>

Properties

[hide private]

Inherited from object: __class__

Method Details

[hide private]

init(self, stream, lowercase=False, blankspaces=`'` `\x0c\n\\c\r\t\x0b\xc2\x85'`, separators=`'``~!@#$%^&()_-+=\\|]}[{";:/?.>,<\xe2\x80\xa0\xe2\x80\x9e\xe2\`...`)
(Constructor)*

source code

Constructor of the ReverseTokenizer class.

Parameters:

stream (str or io.IOBase) - The stream to tokenize. Can be a filename or any open IO stream.
blankspaces (str) - The characters that represent empty spaces.
separators (str) - The characters that separate token units (e.g. word boundaries).

Overrides: object.__init__

count_tokens(self)

source code

Check the number of tokens left.

Returns: int: The number of tokens left.
Overrides: Tokenizer.count_tokens

count_chars(self)

source code

Count the number of characters in the stream.

Returns: int: The number of characters in the stream.
Overrides: Tokenizer.count_chars

Note: Should return the same value as the wc Unix command.

has_more_tokens(self)

source code

Test if at least one token remains.

Returns: bool: True or False weither there is at least one token left in the stream. (Keep in mind that the stream is tokenized from the end to the beginning).
Overrides: Tokenizer.has_more_tokens

next_token(self)

source code

Retrieve the next token in the stream.

Returns: str: Return the next token or '' if there is no next token.
Overrides: Tokenizer.next_token

Note: As this is a reversed tokenizer the "next" token is currently what one would call the "previous" token but in the tokenizer workflow if think its more logic to call it the "next" token.

progress(self)

source code

Return the progress percentage.

Returns: float: The tokenization progress percentage.
Overrides: Tokenizer.progress

reset_stream(self)

source code

Reset the offset to the end offset.

Overrides: Tokenizer.reset_stream

Class ReverseTokenizer

__init__(self, stream, lowercase=False, blankspaces=' \x0c\n\\c\r\t\x0b\xc2\x85', separators='`~!@#$%^&*()_-+=\\|]}[{";:/?.>,<\xe2\x80\xa0\xe2\x80\x9e\xe2\...) (Constructor)

count_tokens(self)

count_chars(self)

has_more_tokens(self)

next_token(self)

progress(self)

reset_stream(self)

init(self, stream, lowercase=False, blankspaces=`'` `\x0c\n\\c\r\t\x0b\xc2\x85'`, separators=`'``~!@#$%^&()_-+=\\|]}[{";:/?.>,<\xe2\x80\xa0\xe2\x80\x9e\xe2\`...`)
(Constructor)*