Package tipy :: Module tknz :: Class TextTokenizer
[hide private]
[frames] | no frames]

Class TextTokenizer

source code

object --+    
         |    
 Tokenizer --+
             |
            TextTokenizer

Tokenizer to tokenize a text file.

This tokenizer recieve a text file and generate n-grams of a given size "n". It is usefule to the text miner in order to generate n-grams to be inserted in a database.

Class Hierarchy for TextTokenizer
Class Hierarchy for TextTokenizer

Nested Classes [hide private]
    Inherited from Tokenizer
  __metaclass__
Metaclass for defining Abstract Base Classes (ABCs).
Instance Methods [hide private]
 
__init__(self, infile, n, lowercase=False, cutoff=0, callback=None)
TextTokenizer creator.
source code
 
tknize_text(self)
Tokenize a file and return a dictionary mapping its n-grams.
source code

Inherited from object: __delattr__, __format__, __getattribute__, __hash__, __new__, __reduce__, __reduce_ex__, __repr__, __setattr__, __sizeof__, __str__, __subclasshook__

    Inherited from Tokenizer
 
count_chars(self) source code
 
count_tokens(self) source code
 
has_more_tokens(self) source code
bool
is_blankspace(self, char)
Test if a character is a blankspace.
source code
bool
is_separator(self, char)
Test if a character is a separator.
source code
 
next_token(self) source code
 
progress(self) source code
 
reset_stream(self) source code
Class Variables [hide private]
    Inherited from Tokenizer
  __abstractmethods__ = frozenset(['count_chars', 'count_tokens'...
  _abc_cache = <_weakrefset.WeakSet object at 0x7f2a42321710>
  _abc_negative_cache = <_weakrefset.WeakSet object at 0x7f2a423...
  _abc_negative_cache_version = 39
  _abc_registry = <_weakrefset.WeakSet object at 0x7f2a42321690>
Properties [hide private]

Inherited from object: __class__

Method Details [hide private]

__init__(self, infile, n, lowercase=False, cutoff=0, callback=None)
(Constructor)

source code 

TextTokenizer creator.

Parameters:
  • infile (str) - Path to the file to tokenize.
  • n (int) - The n in n-gram. Specify the maximum n-gram size to be created.
  • lowercase (bool) - If True: all tokens are convert to lowercase before being added to the dictionary. If False: tokens case remains untouched.
  • cutoff (int) - Set the minimum number of token occurences. If a token dosen't appear more than this number it is removed from the dictionary before it is returned.
Overrides: object.__init__

tknize_text(self)

source code 

Tokenize a file and return a dictionary mapping its n-grams.

The dictionary looks like:

   { ('in',      'the',    'second'): 4,
     ('right',   'hand',   'of'):     1,
     ('subject', 'to',     'the'):    2,
     ('serious', 'rebuff', 'in'):     1,
     ('spirit',  'is',     'the'):    1 }