Package prest :: Module tknz :: Class TextTokenizer

Class TextTokenizer

object --+    
         |    
 Tokenizer --+
             |
            TextTokenizer

Tokenizer to tokenize a text file.

This tokenizer recieve a text file and generate n-grams of a given size "n". It is usefule to the text miner in order to generate n-grams to be inserted in a database.

Class Hierarchy for TextTokenizer

Instance Methods

[hide private]

__init__(self, infile, n, lowercase=False, cutoff=0, callback=None)
TextTokenizer creator.

source code

tknize_text(self)
Tokenize a file and return a dictionary mapping its n-grams.

source code

Inherited from object: __delattr__, __format__, __getattribute__, __hash__, __new__, __reduce__, __reduce_ex__, __repr__, __setattr__, __sizeof__, __str__, __subclasshook__

Inherited from Tokenizer

count_chars(self)

source code

count_tokens(self)

source code

has_more_tokens(self)

source code

bool

is_blankspace(self, char)
Test if a character is a blankspace.

source code

bool

is_separator(self, char)
Test if a character is a separator.

source code

next_token(self)

source code

progress(self)

source code

reset_stream(self)

source code

Class Variables

[hide private]

Inherited from Tokenizer

__metaclass__ = abc.ABCMeta

Properties

[hide private]

Inherited from object: __class__

Method Details

[hide private]

init(self, infile, n, lowercase=False, cutoff=0, callback=None)
(Constructor)

source code

TextTokenizer creator.

Parameters:

infile (str) - Path to the file to tokenize.
n (int) - The n in n-gram. Specify the maximum n-gram size to be created.
lowercase (bool) - If True: all tokens are convert to lowercase before being added to the dictionary. If False: tokens case remains untouched.
cutoff (int) - Set the minimum number of token occurences. If a token dosen't appear more than this number it is removed from the dictionary before it is returned.

Overrides: object.__init__

tknize_text(self)

source code

Tokenize a file and return a dictionary mapping its n-grams.

The dictionary looks like:

   { ('in',      'the',    'second'): 4,
     ('right',   'hand',   'of'):     1,
     ('subject', 'to',     'the'):    2,
     ('serious', 'rebuff', 'in'):     1,
     ('spirit',  'is',     'the'):    1 }

Class TextTokenizer

__init__(self, infile, n, lowercase=False, cutoff=0, callback=None) (Constructor)

tknize_text(self)

init(self, infile, n, lowercase=False, cutoff=0, callback=None)
(Constructor)