Search / Comparison¶

Levenshtein¶

Basic Usage¶

Calculates the Levenshtein distance between a and b.

Usage:

>>> from web2py_utils import search

>>> leve = search.Levenshtein()

>>> leve.levenshtein('hello', 'hellp')
1

Suggestion¶

Compares a string to a list of strings and returns list of closest matches. Sorted by lowest distance first.

Usage:

>>> leve.suggestion('hello', ['hello', 'world', 'mello', 'hi', 'hell'],
                    number_of_matches=3)
['hello', 'hell', 'mello']

NCD¶

Performs an NCD (Normalized Compression Distance) comparison between two keys.

You can read some documentation on the algorithm here: http://www.sophos.com/blogs/sophoslabs/?p=188

These can either be the contents of a file or search terms.

This will be a float value between 0.0 and 1.1, where the lower the value the closer in similarity.

Basic Usage¶

Calculates the similarity between key1 and key2

Usage:

>>> from web2py_utils.search import ncd
>>> key1 = "h3ll0"
>>> key2 = "hello"
>>> key3 = "lehlo"
>>> key4 = "aeiou"

>>> ncd(key1, key2)
0.071428571428571425
>>> ncd(key2, key3)
0.024390243902439025
>>> ncd(key3, key4)
0.11904761904761904
>>> ncd(key1, key3)
0.095238095238095233
>>> ncd(key1, key4)
0.16666666666666666
>>> ncd(key2, key4)
0.095238095238095233

NGram¶

Usage:

# A simple example
>>> base = ['sdafaf', 'asfwef', 'asdfawe', 'adfwe', 'askfjwehiuasdfji']
>>> tg = search.Ngram(base, min_sim = 0.0)

>>> pprint.pprint(tg.getSimilarStrings('askfjwehiuasdfji'))
>>> print
>>> pprint.pprint(Ngram.compare('sdfeff', 'sdfeff'))
>>> print
>>> pprint.pprint(tg.getBestMatch('afadfwe', 2))

Constructor

PARAMETERS:

haystack - String or list of strings. This is where we will look for
           matches

OPTIONAL PARAMETERS:

min_sim     - minimum similarity score for a string to be considered
              worthy. Default = 0.0
warp        - If warp > 1 short strings are getting away better, if
              warp < 1 they are getting away worse. Default = 1.0
ic          - Ignore case? Default = False
only_alnum  - Only consider alphanumeric characters? Default = False
ngram_len   - n-gram size. Default = 3 (trigram)
padding     - padding size. Default = ngram_len - 1
noise       - noise characters that should be ignored when comparing

class web2py_utils.search.Ngram(haystack, **kwargs)¶

Used to compute similarities between strings. Can easily be used to compare one string to a whole batch of strings or pick the best string out of a list

This code was largely inspired by String::Trigram by Tarek Ahmed. Actually I just translated it to python ;)

You can find the original Perl-Module at http://search.cpan.org/dist/String-Trigram/

classmethod compare(s1, s2)¶

Simply compares two strings and returns the similarity score.

This is a class method. It can be called without instantiating the ngram class first. Example:

>> ngram = Ngram() >> ngram.compare(‘sfewefsf’, ‘sdfafwgah’) >> 0.050000000000000003

PARAMETERS:: s1 - First string s2 - Last string

computeActSimOld(samegrams, allgrams, warp=None)¶

Computes the similarity score between two sets of n-grams according to the following formula:

(a = all trigrams, d = different trigrams, e = warp)

(a**e - d**e)/a**e

PARAMETERS:: samegrams - n-grams that were found in the string allgrams - All n-grams in the search space warp - the warp factor. See __init__ for explanation
RETURNS:: Similarity score as float

computeSimilarity(string, siminfo)¶

Calculates the similarity of the given some information about n-grams. PARAMETERS:

string - This is what we want to get the score of siminfo - A dictionary containing info about n-gram distribution.

(see getSimilarStrings)

RETURNS:: the score as float

compute_act_sim_old(samegrams, allgrams, warp=None)¶

Computes the similarity score between two sets of n-grams according to the following formula:

(a = all trigrams, d = different trigrams, e = warp)

(a**e - d**e)/a**e

PARAMETERS:: samegrams - n-grams that were found in the string allgrams - All n-grams in the search space warp - the warp factor. See __init__ for explanation
RETURNS:: Similarity score as float

compute_similarity(string, siminfo)¶

Calculates the similarity of the given some information about n-grams. PARAMETERS:

string - This is what we want to get the score of siminfo - A dictionary containing info about n-gram distribution.

(see getSimilarStrings)

RETURNS:: the score as float

getBestMatch(needle, count=1)¶

Returns the best match for the given string

PARAMETERS:: needle: The string to search for count: How many results to return
RETURNS:: String that best matched the supplied parameter and the score with which it matched.

getSimilarStrings(string)¶

Retrieves all strings that have a similarity higher than min_sim with “string”

PARAMETERS:

string: The string to compare with the search space

RETURNS:

Dictionary of scoring strings. Example output:

{‘askfjwehiuasdfji’: 1.0, ‘asdfawe’: 0.17391304347826086}

get_best_match(needle, count=1)¶

Returns the best match for the given string

PARAMETERS:: needle: The string to search for count: How many results to return
RETURNS:: String that best matched the supplied parameter and the score with which it matched.

get_similar_strings(string)¶

Retrieves all strings that have a similarity higher than min_sim with “string”

PARAMETERS:

string: The string to compare with the search space

RETURNS:

Dictionary of scoring strings. Example output:

{‘askfjwehiuasdfji’: 1.0, ‘asdfawe’: 0.17391304347826086}

ic(*args)¶

min_sim(*args)¶

ngram_len(*args)¶

ngramify(haystack, ic=None, only_alnum=None, padding=None, noise=None)¶

Takes list of strings and puts them into an index of ngrams. KEY is a ngram, VALUE a list of strings containing that ngram. VALUE has two KEYS:

grams: count of ngram occurring in string len: length of string with padding (if ignoring non alphanumerics,

this gets determined after those characters have been removed)

PARAMETERS:

haystack - List of strings to be indexed ic - Ignore Case? only_alnum - Only alphanumeric charactes? padding - Size of the padding noise - Noise characters that should be removed

RETURNS:

N-Gram index

Structure:

{: ‘abc’: ‘string1’:{‘grams’:0, ‘len’:11}, ‘bcd’: ‘string2’:{‘grams’:0, ‘len’:11}, ‘ing’: ‘string3’:{‘grams’:1, ‘len’:11}, ...

}

noise(*args)¶

only_alnum(*args)¶

padding(*args)¶

reInit(base)¶

Reinitialises the search space.

PARAMETERS:: base - The new search space

re_init(base)¶

Reinitialises the search space.

PARAMETERS:: base - The new search space

warp(*args)¶

Search / Comparison¶

Levenshtein¶

Basic Usage¶

Suggestion¶

NCD¶

Basic Usage¶

NGram¶

Table Of Contents

Previous topic

Next topic

This Page

Navigation

Search / Comparison¶

Levenshtein¶

Basic Usage¶

Suggestion¶

NCD¶

Basic Usage¶

NGram¶

Table Of Contents

Previous topic

Next topic

This Page

Quick search

Navigation