NGram Module Documentation

ngram – A set class that supports lookup by N-gram string similarity

class ngram.NGram(items=None, threshold=0.0, warp=1.0, key=None, N=3, pad_len=None, pad_char=’$’, **kwargs)

A set that supports searching for members by N-gram string similarity.

In Python 2, items should be unicode string or a plain ASCII str (bytestring) - do not use UTF-8 or other multi-byte encodings, because multi-byte characters will be split up.

Parameters:
  • threshold (float in 0.0 ... 1.0) – minimum similarity for a string to be considered a match.
  • warp (float in 1.0 ... 3.0) – use warp greater than 1.0 to increase the similarity of shorter string pairs.
  • items ([item, ..]) – iteration of items to index for N-gram search.
  • N (int >= 2) – number of characters per n-gram.
  • pad_len (int in 0 ... N-1) – how many characters padding to add (defaults to N-1).
  • pad_char (str or unicode) – character to use for padding. Default is ‘$’, but consider using thenon-breaking space character, u'\xa0' (u"\u00A0").
  • key (function(item) -> str/unicode) – Function to convert items into string, default is no conversion. Recommended to use str or unicode for non-string items. Using anonymous function prevents NGram class from being pickled.

Instance variables:

Variables:
  • _grams – For each n-gram, the items containing it and the number of timesthe n-gram occurs in the item as {str:{item:int, ...}, ...}.
  • length – maps items to length of the padded string representations as {item:int, ...}.
add(item)

Add an item to the N-gram index (if it has not already been added).

>>> from ngram import NGram
>>> n = NGram()
>>> n.add("ham")
>>> list(n)
['ham']
>>> n.add("spam")
>>> sorted(list(n))
['ham', 'spam']
clear()

Remove all elements from this set.

>>> from ngram import NGram
>>> n = NGram(['spam', 'eggs'])
>>> sorted(list(n))
['eggs', 'spam']
>>> n.clear()
>>> list(n)
[]
static compare(s1, s2, **kwargs)

Compares two strings and returns their similarity.

Parameters:
  • s1 – first string
  • s2 – second string
  • kwargs – additional keyword arguments passed to __init__.
Returns:

similarity between 0.0 and 1.0.

>>> from ngram import NGram
>>> NGram.compare('spa', 'spam')
0.375
>>> NGram.compare('ham', 'bam')
0.25
>>> NGram.compare('spam', 'pam') #N=2
0.375
>>> NGram.compare('ham', 'ams', N=1)
0.5
copy(items=None)

Return a new NGram object with the same settings, and referencing the same items. Copy is shallow in that each item is not recursively copied. Optionally specify alternate items to populate the copy.

>>> from ngram import NGram
>>> from copy import deepcopy
>>> n = NGram(['eggs', 'spam'])
>>> m = n.copy()
>>> m.add('ham')
>>> sorted(list(n))
['eggs', 'spam']
>>> sorted(list(m))
['eggs', 'ham', 'spam']
>>> p = n.copy(['foo', 'bar'])
>>> sorted(list(p))
['bar', 'foo']
difference(*others)

Return the difference of two or more sets as a new set.

>>> from ngram import NGram
>>> a = NGram(['spam', 'eggs'])
>>> b = NGram(['spam', 'ham'])
>>> list(a.difference(b))
['eggs']
difference_update(other)

Remove from this set all elements from other set.

>>> from ngram import NGram
>>> n = NGram(['spam', 'eggs'])
>>> other = set(['spam'])
>>> n.difference_update(other)
>>> list(n)
['eggs']
discard(item)

Remove an element from a set if it is a member.

If the element is not a member, do nothing.

>>> from ngram import NGram
>>> n = NGram(['spam', 'eggs'])
>>> n.discard('spam')
>>> n.discard('ham')
>>> list(n)
['eggs']
find(query, threshold=None)

Simply return the best match to the query, None on no match.

>>> from ngram import NGram
>>> n = NGram(["Spam","Eggs","Ham"], key=lambda x:x.lower(), N=1)
>>> n.find('Hom')
'Ham'
>>> n.find("Spom")
'Spam'
>>> n.find("Spom", 0.8)
finditem(item, threshold=None)

Return most similar item to the provided one, or None if nothing exceeds the threshold.

>>> from ngram import NGram
>>> n = NGram([(0, "Spam"), (1, "Ham"), (2, "Eggsy"), (3, "Egggsy")],
...     key=lambda x:x[1].lower())
>>> n.finditem((3, 'Hom'))
(1, 'Ham')
>>> n.finditem((4, "Oggsy"))
(2, 'Eggsy')
>>> n.finditem((4, "Oggsy"), 0.8)
intersection(*others)

Return the intersection of two or more sets as a new set.

>>> from ngram import NGram
>>> a = NGram(['spam', 'eggs'])
>>> b = NGram(['spam', 'ham'])
>>> list(a.intersection(b))
['spam']
intersection_update(*others)

Update the set with the intersection of itself and other sets.

>>> from ngram import NGram
>>> n = NGram(['spam', 'eggs'])
>>> other = set(['spam', 'ham'])
>>> n.intersection_update(other)
>>> list(n)
['spam']
items_sharing_ngrams(query)

Retrieve the subset of items that share n-grams the query string.

Parameters:query – look up items that share N-grams with this string.
Returns:mapping from matched string to the number of shared N-grams.
>>> from ngram import NGram
>>> n = NGram(["ham","spam","eggs"])
>>> sorted(n.items_sharing_ngrams("mam").items())
[('ham', 2), ('spam', 2)]
key(item)

Get the key string for the item.

>>> from ngram import NGram
>>> n = NGram(key=lambda x:x[1])
>>> n.key((3,"ham"))
'ham'
static ngram_similarity(samegrams, allgrams, warp=1.0)

Similarity for two sets of n-grams.

Note:

similarity = (a**e - d**e)/a**e where a is “all n-grams”, d is “different n-grams” and e is the warp.

Parameters:
  • samegrams – number of n-grams shared by the two strings.
  • allgrams – total of the distinct n-grams across the two strings.
Returns:

similarity in the range 0.0 to 1.0.

>>> from ngram import NGram
>>> NGram.ngram_similarity(5, 10)
0.5
>>> NGram.ngram_similarity(5, 10, warp=2)
0.75
>>> NGram.ngram_similarity(5, 10, warp=3)
0.875
>>> NGram.ngram_similarity(2, 4, warp=2)
0.75
>>> NGram.ngram_similarity(3, 4)
0.75
ngrams(string)

Alias for 3.1 compatibility, please set pad_len=0 and use split.

ngrams_pad(string)

Alias for 3.1 compatibility, please use split instead.

pad(string)

Pad a string in preparation for splitting into ngrams.

>>> from ngram import NGram
>>> n = NGram()
>>> n.pad('ham')
'$$ham$$'
pop()

Remove and return an arbitrary set element. Raises KeyError if the set is empty.

>>> from ngram import NGram
>>> n = NGram(['spam', 'eggs'])
>>> x = n.pop()
>>> len(n)
1
remove(item)

Remove an item from the set. Inverts the add operation.

>>> from ngram import NGram
>>> n = NGram(['spam', 'eggs'])
>>> n.remove('spam')
>>> list(n)
['eggs']
search(query, threshold=None)

Search the index for items whose key exceeds threshold similarity to the query string.

Parameters:query – returned items will have at least threshold similarity to the query string.
Returns:list of pairs of (item, similarity) by decreasing similarity.
>>> from ngram import NGram
>>> n = NGram([(0, "SPAM"), (1, "SPAN"), (2, "EG")], key=lambda x:x[1])
>>> sorted(n.search("SPA"))
[((0, 'SPAM'), 0.375), ((1, 'SPAN'), 0.375)]
>>> n.search("M")
[((0, 'SPAM'), 0.125)]
>>> n.search("EG")
[((2, 'EG'), 1.0)]
searchitem(item, threshold=None)

Search the index for items whose key exceeds the threshold similarity to the key of the given item.

Returns:list of pairs of (item, similarity) by decreasing similarity.
>>> from ngram import NGram
>>> n = NGram([(0, "SPAM"), (1, "SPAN"), (2, "EG"),
... (3, "SPANN")], key=lambda x:x[1])
>>> sorted(n.searchitem((2, "SPA"), 0.35))
[((0, 'SPAM'), 0.375), ((1, 'SPAN'), 0.375)]
split(string)

Pads a string and iterates over its ngrams.

>>> from ngram import NGram
>>> n = NGram()
>>> list(n.split("ham"))
['$$h', '$ha', 'ham', 'am$', 'm$$']
splititem(item)

Pads the string key of an item and iterates over its ngrams.

>>> from ngram import NGram
>>> n = NGram(key=lambda x:x[1])
>>> item = (3,"ham")
>>> list(n.splititem(item))
['$$h', '$ha', 'ham', 'am$', 'm$$']
symmetric_difference(other)

Return the symmetric difference of two sets as a new set.

>>> from ngram import NGram
>>> a = NGram(['spam', 'eggs'])
>>> b = NGram(['spam', 'ham'])
>>> sorted(list(a.symmetric_difference(b)))
['eggs', 'ham']
symmetric_difference_update(other)

Update the set with the symmetric difference of itself and other.

>>> from ngram import NGram
>>> n = NGram(['spam', 'eggs'])
>>> other = set(['spam', 'ham'])
>>> n.symmetric_difference_update(other)
>>> sorted(list(n))
['eggs', 'ham']
union(*others)

Return the union of two or more sets as a new set.

>>> from ngram import NGram
>>> a = NGram(['spam', 'eggs'])
>>> b = NGram(['spam', 'ham'])
>>> sorted(list(a.union(b)))
['eggs', 'ham', 'spam']
update(items)

Update the set with new items.

>>> from ngram import NGram
>>> n = NGram(["spam"])
>>> n.update(["eggs"])
>>> sorted(list(n))
['eggs', 'spam']