Comparing and searching strings

Compare two strings using compare():

>>> import ngram
>>> ngram.NGram.compare('Ham','Spam',N=1)

The NGram class extends the builtin set class with the ability to search for members by n-gram similarity.

Use search() to return similar items in a set, and find() to only return the most similar item:

>>> G = ngram.NGram(['joe','joseph','jon','john','sally'])
>>> G.search('jon')
[('jon', 1.0), ('john', 0.375), ('joe', 0.25), ('joseph', 0.18181818181818182)]
>>> G.search('jon', threshold=0.3)
[('jon', 1.0), ('john', 0.375)]
>>> G.find('jose')

Transforming items

By default no transformation is done, so items must be strings. By passing a key function, similar to the key parameter of the sorted builtin, arbitrary items can be indexed. For non-string items, generally pass str or unicode as the key function.

Below, we define a key function to index the lower-case version of a string, and use the key, pad and ngrams methods to examine the internal representations:

>>> def lower(s):
...     return s.lower()
>>> G = ngram.NGram(key=lower)
>>> G.key('AbC')
>>> G.pad('abc')
>>> list(G.split('abc'))
['$$a', '$ab', 'abc', 'bc$', 'c$$']

Searching with a lowercase query returns results, but there is no match if the query contains capitals:

>>> G.add('AbC')
>>> G.search('abcd')
[('AbC', 0.375)]
>>> G.find('abcd')
>>> G.search('AbC') == []
>>> G.find('AbC') == None

We can either lowercase the query, or use the “searchitem” or “finditem” methods that apply the key function to the query before searching:

>>> G.search('AbCD'.lower())
[('AbC', 0.375)]
>>> G.find(lower('AbCD'))
>>> G.searchitem('AbCD')
[('AbC', 0.375)]
>>> G.finditem('AbCD')

So long as the function can be found in __main__ or imported, NGram instances can be pickled:

>>> import pickle
>>> pickle.dumps(G)  
Traceback (most recent call last):
PicklingError: Can't pickle <function lower at ...>: it's not found as __main__.lower
>>> import string
>>> H1 = ngram.NGram(['ab cd'], key=string.capwords)
>>> text = pickle.dumps(H1)
>>> H2 = pickle.loads(text)
>>> list(H1)
['ab cd']
>>> list(H2)
['ab cd']

The key function can perform complex transformations:

>>> G = ngram.NGram(key=lambda x:(" ".join(x)).lower())
>>> G.add(("Joe","Bloggs"))
>>> G.search("jeo blogger")
[(('Joe', 'Bloggs'), 0.25)]
>>> G.searchitem(("Jeo", "Blogger"))
[(('Joe', 'Bloggs'), 0.25)]

Set Operations

The update, discard, difference_update, intersection_update and symmetric_difference update methods from the builtin set class have been overridden to maintain the integrity of the NGram index when performing them. These take any iterable as argument, including another NGram instance.

>>> G = ngram.NGram(['joe','joseph','jon','john','sally'])
>>> G.update(['jonathan'])
>>> sorted(list(G))
['joe', 'john', 'jon', 'jonathan', 'joseph', 'sally']
>>> G.discard('sally')
>>> sorted(list(G))
['joe', 'john', 'jon', 'jonathan', 'joseph']
>>> G.difference_update(ngram.NGram(['joe']))
>>> sorted(list(G))
['john', 'jon', 'jonathan', 'joseph']
>>> G.intersection_update(['james', 'joseph', 'joe', 'jon'])
>>> sorted(list(G))
['jon', 'joseph']
>>> G.symmetric_difference_update(ngram.NGram(['jimmy', 'jon']))
>>> sorted(list(G))
['jimmy', 'joseph']

Multi-byte characters

When used with byte-strings, NGram will split on byte boundaries which is incorrect if one character uses more than one byte. This is mainly a problem in Python 2 where you often handle encoded byte strings. In Python 3, you will generally be handed a unicode string.

Rule Of Thumb: Use Unicode strings with NGram unless you are certain that your encoded strings are plain ASCII.

In Python 2.x, NGram does work fine with ASCII byte-strings:

>>> index = ngram.NGram(N=3)
>>> list(index.ngrams(index.pad("abc")))
['$$a', '$ab', 'abc', 'bc$', 'c$$']

But, take é (code point 0xE9) for example. As a UTF-8 byte-string this takes up 2-bytes ('\xc3\xa9') and so will be split into 2 characters.

But as a Unicode strings, it is simply u'\xe9' ('\xe9' in Py3) and will be handled correctly as a single character.

>>> index = ngram.NGram(pad_len=1, N=3)
>>> import sys
>>> # Fails in Python 3 (non-ASCII forbidden in literal bytestrings)
>>> # list(index.split('é')) == ['$\xc3\xa9', '\xc3\xa9$']
>>> # Fails in Python 3 (the u'' syntax is removed)
>>> # list(index.split(u'\xe9')) == [u'$\xe9$']