tipy.prdct.DictionaryPredictor

The dictionary is a file containing one word per line. This predictor does not use n-grams and is therefore less effective than the predictors using n-grams because it does not consider context.

init(self, config, contextMonitor, predictorName)
(Constructor)

source code

DictionaryPredictor creator.

Parameters:

config (drvr.Configuration) - The config is used to retrieve the predictor settings from the config file.
contextMonitor (ContextMonitor) - The contextMonitor is needed because it allow the predictor to get the input buffers tokens.
predictorName (str) - The custom name of the configuration using this predictor.

Overrides: object.__init__

Note: The string.lower() and string.strip() methods have a great impact on performance (the profile module show that they require almost 1 second of processing time when calculating suggestions for 10 contexts. So this constructor no more directly use the dictionary file. A database is created instead. Every words of the dictionary are lowered and stripped then added to the database. Doing so, the performance of the predictor are way better. Profiling a script querying suggestions for 10 successive contexts show the improvement profits:

lower()ing and strip()ping each word of the file on each predict() call:

   ncalls  tottime  percall  cumtime  percall filename:lineno
   690048    0.468    0.000    0.468    0.000 :0(lower)

Creating an improved list upon initialization and using it on each predict() call (previous optimization method):
```
   ncalls  tottime  percall  cumtime  percall filename:lineno
   100046    0.059    0.000    0.059    0.000 :0(lower)
```
It is approx. 800% faster. But this profiling mix initialization and later computation. It means than most of the time of the previous profiling line is spend in initializing the list, computation on each predict() call are even more profitable.

Creating a database and querying it on each predict() call:

     ncalls  tottime  percall  cumtime  percall filename:lineno
     100046    0.059    0.000    0.059    0.000 :0(lower)
 It is not faster than the previous method but the database
 must only be created once. And once it is created the
 initialization time is (near) null and the querying time on
 each predict() call is even faster.

Change Log:

08/06/15: Method now create an ordered optimized list containing dictionary words upon initialization in order to increase the speed of the predictor.
13/06/15: Method now use a database containing the dictionary words. See: minr.DictMiner

get_dict_range(self, prefix)

source code

Select the dictionary range where words starts with the given prefix.

A suggested word must complete the given token, it means that suggested words all start with this token, here called the prefix. This method create a list containing the suggested words for the given prefix, i.e. every words of the dictionary list starting with the prefix. It is easy as the dictionary list is ordered. For instance:

If the prefix is:

   'hell'

And the dictionary list is:

   ['bird', 'blue', 'given', 'hair', 'hellish', 'hello', 'red', 'zip']

We first remove every words of the list one by one until we reach a word which actualy starts with the prefix 'hell', then we have:

   ['hellish', 'hello', 'red', 'zip']

Finaly we scan every words of the remaining list and when we reach a word which does not starts with the given prefix then we know that every remaining words won't start with the prefix neither as the list is ordered, so we have:

   ['hellish', 'hello']

Parameters:

prefix (str) - The prefix from which suggested words range is computed.

Deprecated: This method has become useless since the words are now stored in a database.

predict(self, maxPartialPredictionSize, stopList)

source code

Complete the actual word or predict the next word using dictionary.

Use the input buffers (thanks to contextMonitor) and the word dictionary to predict the most probable suggestions. A suggestion is a word which can:

Predict the end of the world. i.e. complete the actual partial word (the user has not finished to input the word, we try to predict the end of the word).
Predict the next word (the user has type a separator after a word, we try to predict the next word before he starts to type it).

In order to compute the suggestions, this method:

Retrieve the last token from the left input buffer.
Loop for each word in the dictionary:
- If the word starts with the last token retrieved: add it to the suggestion list if we have not reach the maximum number of suggestions yet. It is not necessary to check if the word is already in the suggestion list because in a dictionary a word should only appear once. In any case, the merger will merge the duplicate suggestions.

Parameters:

maxPartialPredictionSize (int) - Maximum number of suggestion to compute. If this number is reached, the suggestions list is immediatly return. DatabaseConnector.ngram_table_tp() returns the records in descending order according to their number of occurences so the most probable suggestions will be added to the list first. This result in no suggestion quality loss, regardless of the desired number of suggestions.
stopList (list) - The stoplist is a list of undesirable words. Any suggestion which is in the stopList won't be added to the suggestions list.

Returns: Prediction

A list of every suggestions possible (limited to maxPartialPredictionSize).

Overrides: Predictor.predict

Class DictionaryPredictor

init(self, config, contextMonitor, predictorName)
(Constructor)

init_database_connector(self)

get_dict_range(self, prefix)

predict(self, maxPartialPredictionSize, stopList)

learn(self, text)

Class DictionaryPredictor

__init__(self, config, contextMonitor, predictorName) (Constructor)

init_database_connector(self)

get_dict_range(self, prefix)

predict(self, maxPartialPredictionSize, stopList)

learn(self, text)

init(self, config, contextMonitor, predictorName)
(Constructor)