Calculates the Levenshtein distance between a and b.
Usage:
>>> from web2py_utils import search
>>> leve = search.Levenshtein()
>>> leve.levenshtein('hello', 'hellp')
1
Compares a string to a list of strings and returns list of closest matches. Sorted by lowest distance first.
Usage:
>>> leve.suggestion('hello', ['hello', 'world', 'mello', 'hi', 'hell'],
number_of_matches=3)
['hello', 'hell', 'mello']
Performs an NCD (Normalized Compression Distance) comparison between two keys.
You can read some documentation on the algorithm here: http://www.sophos.com/blogs/sophoslabs/?p=188
These can either be the contents of a file or search terms.
This will be a float value between 0.0 and 1.1, where the lower the value the closer in similarity.
Calculates the similarity between key1 and key2
Usage:
>>> from web2py_utils.search import ncd
>>> key1 = "h3ll0"
>>> key2 = "hello"
>>> key3 = "lehlo"
>>> key4 = "aeiou"
>>> ncd(key1, key2)
0.071428571428571425
>>> ncd(key2, key3)
0.024390243902439025
>>> ncd(key3, key4)
0.11904761904761904
>>> ncd(key1, key3)
0.095238095238095233
>>> ncd(key1, key4)
0.16666666666666666
>>> ncd(key2, key4)
0.095238095238095233
Usage:
# A simple example
>>> base = ['sdafaf', 'asfwef', 'asdfawe', 'adfwe', 'askfjwehiuasdfji']
>>> tg = search.Ngram(base, min_sim = 0.0)
>>> pprint.pprint(tg.getSimilarStrings('askfjwehiuasdfji'))
>>> print
>>> pprint.pprint(Ngram.compare('sdfeff', 'sdfeff'))
>>> print
>>> pprint.pprint(tg.getBestMatch('afadfwe', 2))
Constructor
PARAMETERS:
haystack - String or list of strings. This is where we will look for
matches
OPTIONAL PARAMETERS:
min_sim - minimum similarity score for a string to be considered
worthy. Default = 0.0
warp - If warp > 1 short strings are getting away better, if
warp < 1 they are getting away worse. Default = 1.0
ic - Ignore case? Default = False
only_alnum - Only consider alphanumeric characters? Default = False
ngram_len - n-gram size. Default = 3 (trigram)
padding - padding size. Default = ngram_len - 1
noise - noise characters that should be ignored when comparing
Used to compute similarities between strings. Can easily be used to compare one string to a whole batch of strings or pick the best string out of a list
This code was largely inspired by String::Trigram by Tarek Ahmed. Actually I just translated it to python ;)
You can find the original Perl-Module at http://search.cpan.org/dist/String-Trigram/
Simply compares two strings and returns the similarity score.
This is a class method. It can be called without instantiating the ngram class first. Example:
>> ngram = Ngram() >> ngram.compare(‘sfewefsf’, ‘sdfafwgah’) >> 0.050000000000000003
Computes the similarity score between two sets of n-grams according to the following formula:
(a = all trigrams, d = different trigrams, e = warp)
(a**e - d**e)/a**e
Calculates the similarity of the given some information about n-grams. PARAMETERS:
string - This is what we want to get the score of siminfo - A dictionary containing info about n-gram distribution.
(see getSimilarStrings)
Computes the similarity score between two sets of n-grams according to the following formula:
(a = all trigrams, d = different trigrams, e = warp)
(a**e - d**e)/a**e
Calculates the similarity of the given some information about n-grams. PARAMETERS:
string - This is what we want to get the score of siminfo - A dictionary containing info about n-gram distribution.
(see getSimilarStrings)
Returns the best match for the given string
Retrieves all strings that have a similarity higher than min_sim with “string”
Dictionary of scoring strings. Example output:
{‘askfjwehiuasdfji’: 1.0, ‘asdfawe’: 0.17391304347826086}
Returns the best match for the given string
Retrieves all strings that have a similarity higher than min_sim with “string”
Dictionary of scoring strings. Example output:
{‘askfjwehiuasdfji’: 1.0, ‘asdfawe’: 0.17391304347826086}
Takes list of strings and puts them into an index of ngrams. KEY is a ngram, VALUE a list of strings containing that ngram. VALUE has two KEYS:
grams: count of ngram occurring in string len: length of string with padding (if ignoring non alphanumerics,
this gets determined after those characters have been removed)
N-Gram index
Structure:
}
Reinitialises the search space.
Reinitialises the search space.