Whoosh provides methods for computing the “key terms” of a set of documents. For these methods, “key terms” basically means terms that are frequent in the given documents, but relatively infrequent in the indexed collection as a whole.
Because this is a purely statistical operation, not a natural language processing or AI function, the quality of the results will vary based on the content, the size of the document collection, and the number of documents for which you extract keywords.
These methods can be useful for providing the following features to users:
Get more documents like a certain search hit. This requires that the field you want to match on is vectored or stored, or that you have access to the original text (such as from a database).
results = mysearcher.search(myquery) first_hit = results more_results = first_hit.more_like_this("content")
Extract keywords for the top N documents in a whoosh.searching.Results object. This requires that the field is either vectored or stored.
For example, to extract five key terms from the content field of the top ten documents of a results object:
keywords = [keyword for keyword, score in results.key_terms("content", docs=10, numterms=5)
Extract keywords for an arbitrary set of documents. This requires that the field is either vectored or stored.
For example, let’s say you have an index of emails. To extract key terms from the content field of emails whose emailto field contains firstname.lastname@example.org:
with email_index.searcher() as s: docnums = s.document_numbers(emailto=u"email@example.com") keywords = [keyword for keyword, score in s.key_terms(docnums, "body")]
Extract keywords from arbitrary text not in the index.
with email_index.searcher() as s: keywords = [keyword for keyword, score in s.key_terms_from_text("body", mytext)]
The ExpansionModel subclasses in the whoosh.classify module implement different weighting functions for key words. These models are translated into Python from original Java implementations in Terrier.