Package tipy :: Module tknz :: Class ForwardTokenizer
[hide private]
[frames] | no frames]

Class ForwardTokenizer

source code

object --+    
         |    
 Tokenizer --+
             |
            ForwardTokenizer

Tokenize a stream from the beginning to the end.

Class Hierarchy for ForwardTokenizer
Class Hierarchy for ForwardTokenizer

Nested Classes [hide private]
    Inherited from Tokenizer
  __metaclass__
Metaclass for defining Abstract Base Classes (ABCs).
Instance Methods [hide private]
 
__init__(self, stream, lowercase=False, blankspaces=' \x0c\n\\c\r\t\x0b\xc2\x85', separators='`~!@#$%^&*()_-+=\\|]}[{";:/?.>,<\xe2\x80\xa0\xe2\x80\x9e\xe2\...)
Constructor of the ForwardTokenizer class.
source code
int
count_tokens(self)
Check the number of tokens left.
source code
int
count_chars(self)
Count the number of characters in the stream.
source code
bool
has_more_tokens(self)
Test if at least one token remains.
source code
str
next_token(self)
Retrieve the next token in the stream.
source code
float
progress(self)
Return the progress percentage.
source code
 
reset_stream(self)
Reset the offset to 0.
source code

Inherited from object: __delattr__, __format__, __getattribute__, __hash__, __new__, __reduce__, __reduce_ex__, __repr__, __setattr__, __sizeof__, __str__, __subclasshook__

    Inherited from Tokenizer
bool
is_blankspace(self, char)
Test if a character is a blankspace.
source code
bool
is_separator(self, char)
Test if a character is a separator.
source code
Class Variables [hide private]
  __abstractmethods__ = frozenset([])
    Inherited from Tokenizer
  _abc_cache = <_weakrefset.WeakSet object at 0x7f2a42321710>
  _abc_negative_cache = <_weakrefset.WeakSet object at 0x7f2a423...
  _abc_negative_cache_version = 39
  _abc_registry = <_weakrefset.WeakSet object at 0x7f2a42321690>
Properties [hide private]

Inherited from object: __class__

Method Details [hide private]

__init__(self, stream, lowercase=False, blankspaces=' \x0c\n\\c\r\t\x0b\xc2\x85', separators='`~!@#$%^&*()_-+=\\|]}[{";:/?.>,<\xe2\x80\xa0\xe2\x80\x9e\xe2\...)
(Constructor)

source code 

Constructor of the ForwardTokenizer class.

Parameters:
  • stream (str or io.IOBase) - The stream to tokenize. Can be a filename or any open IO stream.
  • blankspaces (str) - The characters that represent empty spaces.
  • separators (str) - The characters that separate token units (e.g. word boundaries).
Overrides: object.__init__

Warning: When passing IOBase type variable as stream parameter: the read() method is used to read the stream and it can be time consuming. Please don't pass IOBase during the prediction process!

count_tokens(self)

source code 

Check the number of tokens left.

Returns: int
The number of tokens left.
Overrides: Tokenizer.count_tokens

count_chars(self)

source code 

Count the number of characters in the stream.

Returns: int
The number of characters in the stream.
Overrides: Tokenizer.count_chars

Note: Should return the same value as the wc Unix command.

has_more_tokens(self)

source code 

Test if at least one token remains.

Returns: bool
True or False weither there is at least one token left in the stream.
Overrides: Tokenizer.has_more_tokens

next_token(self)

source code 

Retrieve the next token in the stream.

Returns: str
Return the next token or '' if there is no next token.
Overrides: Tokenizer.next_token

progress(self)

source code 

Return the progress percentage.

Returns: float
The tokenization progress percentage.
Overrides: Tokenizer.progress

reset_stream(self)

source code 

Reset the offset to 0.

Overrides: Tokenizer.reset_stream