engine¶
Pythonic wrapper around PyLucene search engine.
Provides high-level interfaces to indexes and documents, abstracting away java lucene primitives.
- TokenFilter, Analyzer, IndexSearcher, MultiSearcher, IndexWriter, Indexer, ParallelIndexer
- Document, Field, MapField, NestedField, DocValuesField, NumericField, DateTimeField
- Query, SortField, TermsFilter
- PointField, PolygonField
indexers¶
Wrappers for lucene Index{Read,Search,Writ}ers.
The final Indexer classes exposes a high-level Searcher and Writer.
TokenStream¶
TokenFilter¶
-
class
lupyne.engine.indexers.
TokenFilter
(input)[source]¶ Bases:
PythonTokenFilter
,lupyne.engine.indexers.TokenStream
Create an iterable lucene TokenFilter from a TokenStream. Subclass and override
incrementToken()
orsetattrs()
.
Analyzer¶
-
class
lupyne.engine.indexers.
Analyzer
(tokenizer, *filters)[source]¶ Return a lucene Analyzer which chains together a tokenizer and filters.
Parameters: - tokenizer – lucene Analyzer or Tokenizer factory
- filters – lucene TokenFilters
-
static
parse
(query, field='', op='', version=None, parser=None, **attrs)[source]¶ Return parsed lucene Query.
Parameters: - query – query string
- field – default query field name, sequence of names, or boost mapping
- op – default query operator (‘or’, ‘and’)
- version – lucene Version
- parser – custom PythonQueryParser class
- attrs – additional attributes to set on the parser
IndexReader¶
-
class
lupyne.engine.indexers.
IndexReader
(reader)[source]¶ Delegated lucene IndexReader, with a mapping interface of ids to document objects.
Parameters: reader – lucene IndexReader -
copy
(dest, query=None, exclude=None, merge=0)[source]¶ Copy the index to the destination directory. Optimized to use hard links if the destination is a file system path.
Parameters: - dest – destination directory path or lucene Directory
- query – optional lucene Query to select documents
- exclude – optional lucene Query to exclude documents
- merge – optionally merge into maximum number of segments
-
directory
¶ reader’s lucene Directory
-
docs
(name, value, counts=False)[source]¶ Generate doc ids which contain given term, optionally with frequency counts.
-
morelikethis
(doc, *fields, **attrs)[source]¶ Return MoreLikeThis query for document.
Parameters: - doc – document id or text
- fields – document fields to use, optional for termvectors
- attrs – additional attributes to set on the morelikethis object
-
names
(**attrs)[source]¶ Return field names, given option description.
Changed in version 1.2: lucene requires FieldInfo filter attributes instead of option
-
numbers
(name, step=0, type=<type 'int'>, counts=False)[source]¶ Generate decoded numeric term values, optionally with frequency counts.
Parameters: - name – field name
- step – precision step to select terms
- type – int or float
- counts – include frequency counts
-
path
¶ FSDirectory path
-
positions
(name, value, payloads=False, offsets=False)[source]¶ Generate doc ids and positions which contain given term, optionally with offsets, or only ones with payloads.
-
positionvector
(id, field, offsets=False)[source]¶ Generate terms and positions for given doc id and field, optionally with character offsets.
-
readers
¶ segment readers
-
segments
¶ segment filenames with document counts
-
spans
(query, positions=False, payloads=False)[source]¶ Generate docs with occurrence counts for a span query.
Parameters: - query – lucene SpanQuery
- positions – optionally include slice positions instead of counts
- payloads – optionally only include slice positions with payloads
-
terms
(name, value='', stop=None, counts=False, distance=0)[source]¶ Generate a slice of term values, optionally with frequency counts. Supports a range of terms, wildcard terms, or fuzzy terms.
Parameters: - name – field name
- value – term prefix or lower bound for range terms
- stop – optional upper bound for range terms
- counts – include frequency counts
- distance – maximum edit distance for fuzzy terms
-
termvector
(id, field, counts=False)[source]¶ Generate terms for given doc id and field, optionally with frequency counts.
-
timestamp
¶ timestamp of reader’s last commit
-
IndexSearcher¶
-
class
lupyne.engine.indexers.
IndexSearcher
(directory, analyzer=None)[source]¶ Bases:
IndexSearcher
,lupyne.engine.indexers.IndexReader
Inherited lucene IndexSearcher, with a mixed-in IndexReader.
Parameters: - directory – directory path, lucene Directory, or lucene IndexReader
- analyzer – lucene Analyzer, default StandardAnalyzer
-
filters
¶ Mapping of cached filters by field, also used for facet counts.
-
sorters
¶ Mapping of cached sorters by field and associated parsers.
-
spellcheckers
¶ Mapping of cached spellcheckers by field.
-
termsfilters
¶ Set of registered termsfilters.
-
comparator
(field, type='string', parser=None, multi=False)[source]¶ Return cache of field values suitable for sorting, using a cached SortField if available. Parsing values into an array is memory optimized. Map values into a list for speed optimization. Comparators are not thread-safe.
Parameters: - name – field name
- type – type object or name compatible with FieldCache
- parser – lucene FieldCache.Parser or callable applied to field values
- multi – retrieve multi-valued string terms as a tuple
-
correct
(field, text, distance=2)[source]¶ Generate potential words ordered by increasing edit distance and decreasing frequency. For optimal performance only iterate the required slice size of corrections.
Parameters: distance – the maximum edit distance to consider for enumeration
-
distances
(lng, lat, lngfield, latfield)[source]¶ Return distance comparator computed from cached lat/lng fields.
-
facets
(query, *keys)[source]¶ Return mapping of document counts for the intersection with each facet.
Changed in version 1.6: filters are no longer implicitly cached, a GroupingSearch is used instead
Parameters: - query – query string, lucene Query, or lucene Filter
- keys – field names, term tuples, or any keys to previously cached filters
-
groupby
(field, query, filter=None, count=None, start=0, **attrs)[source]¶ Return Hits grouped by field using a GroupingSearch.
-
highlighter
(query, field, **kwargs)[source]¶ Return Highlighter or if applicable FastVectorHighlighter specific to searcher and query.
-
classmethod
load
(directory, analyzer=None)[source]¶ Open IndexSearcher with a lucene RAMDirectory, loading index into memory.
-
match
(document, *queries)[source]¶ Generate scores for all queries against a given document mapping.
-
reopen
(filters=False, sorters=False, spellcheckers=False)[source]¶ Return current IndexSearcher, only creating a new one if necessary. Any registered
termsfilters
are also refreshed.Parameters: - filters – refresh cached facet
filters
- sorters – refresh cached
sorters
with associated parsers - spellcheckers – refresh cached
spellcheckers
- filters – refresh cached facet
-
search
(query=None, filter=None, count=None, sort=None, reverse=False, scores=False, maxscore=False, timeout=None, **parser)[source]¶ Run query and return Hits.
Changed in version 1.4: sort param for lucene only; use Hits.sorted with a callable
Parameters: - query – query string or lucene Query
- filter – lucene Filter
- count – maximum number of hits to retrieve
- sort – lucene Sort parameters
- reverse – reverse flag used with sort
- scores – compute scores for candidate results when sorting
- maxscore – compute maximum score of all results when sorting
- timeout – stop search after elapsed number of seconds
- parser –
Analyzer.parse()
options
-
sorter
(field, type='string', parser=None, reverse=False)[source]¶ Return SortField with cached attributes if available.
-
termsfilter
(field, values=())[source]¶ Return registered TermsFilter, which will be refreshed whenever the searcher is reopened.
New in version 1.7.
Note
This interface is experimental and might change in incompatible ways in the next release.
MultiSearcher¶
-
class
lupyne.engine.indexers.
MultiSearcher
(reader, analyzer=None)[source]¶ Bases:
lupyne.engine.indexers.IndexSearcher
IndexSearcher with underlying lucene MultiReader.
Parameters: - reader – directory paths, Directories, IndexReaders, or a single MultiReader
- analyzer – lucene Analyzer, default StandardAnalyzer
IndexWriter¶
-
class
lupyne.engine.indexers.
IndexWriter
(directory=None, mode='a', analyzer=None, version=None, **attrs)[source]¶ Bases:
IndexWriter
Inherited lucene IndexWriter. Supports setting fields parameters explicitly, so documents can be represented as dictionaries.
Parameters: - directory – directory path or lucene Directory, default RAMDirectory
- mode – file mode (rwa), except updating (+) is implied
- analyzer – lucene Analyzer, default StandardAnalyzer
- version – lucene Version argument passed to IndexWriterConfig or StandardAnalyzer, default is latest
- attrs – additional attributes to set on IndexWriterConfig
-
fields
¶ Mapping of assigned fields. May be used directly, instead of
set()
method, for further customization.
-
__len__
()¶
-
add
(document=(), **terms)[source]¶ Add
document()
to index with optional boost.
-
classmethod
check
(directory, fix=False)[source]¶ Check and optionally fix unlocked index, returning lucene CheckIndex.Status.
-
delete
(*query, **options)[source]¶ Remove documents which match given query or term.
Parameters: - query –
IndexSearcher.search()
compatible query, or optimally a name and value - options – additional
Analyzer.parse()
options
- query –
-
document
(items=(), **terms)[source]¶ Return lucene Document from mapping of field names to one or multiple values.
-
set
(name, cls=<class 'lupyne.engine.documents.Field'>, **settings)[source]¶ Assign settings to field name and return the field.
Parameters:
-
snapshot
(*args, **kwds)[source]¶ Return context manager of an index commit snapshot.
Changed in version 1.4: lucene identifies snapshots by commit generation
-
update
(name, value='', document=(), **terms)[source]¶ Atomically delete documents which match given term and add the new
document()
.Changed in version 1.7: update in-place if only DocValues are given
Indexer¶
-
class
lupyne.engine.indexers.
Indexer
(directory=None, mode='a', analyzer=None, version=None, nrt=False, **attrs)[source]¶ Bases:
lupyne.engine.indexers.IndexWriter
An all-purpose interface to an index. Creates an IndexWriter with a delegated IndexSearcher.
Parameters: nrt – optionally use a near real-time searcher -
commit
(merge=False, **caches)[source]¶ Commit writes and
refresh()
searcher.Parameters: merge – merge segments with deletes, or optionally specify maximum number of segments
-
refresh
(**caches)[source]¶ Store refreshed searcher with
IndexSearcher.reopen()
caches.
-
ParallelIndexer¶
New in version 1.2.
-
class
lupyne.engine.indexers.
ParallelIndexer
(field, *args, **kwargs)[source]¶ Bases:
lupyne.engine.indexers.Indexer
Indexer which tracks a unique identifying field. Handles atomic updates of rapidly changing fields, managing
termsfilters
. Assign custom settings or cache custom sorter for primary field if necessary.-
termsfilters
¶ Mapping of filters to synchronized termsfilters.
-
refresh
(**caches)[source]¶ Store refreshed searcher and synchronize
termsfilters
.
-
termsfilter
(filter, *others)[source]¶ Return TermsFilter synced to given filter and optionally associated with other indexers.
-
documents¶
Wrappers for lucene Fields and Documents.
Document¶
Changed in version 1.5: stored numeric types returned as numbers
-
class
lupyne.engine.documents.
Document
(doc)[source]¶ Bases:
dict
Multimapping of field names to values, but default getters return the first value.
Hit¶
-
class
lupyne.engine.documents.
Hit
(doc, id, score, keys=())[source]¶ Bases:
lupyne.engine.documents.Document
A Document from a search result, with
id
,score
, and optional sortkeys
.
Hits¶
-
class
lupyne.engine.documents.
Hits
(searcher, scoredocs, count=None, maxscore=None, fields=None)[source]¶ Search results: lazily evaluated and memory efficient. Provides a read-only sequence interface to hit objects.
Parameters: - searcher – IndexSearcher which can retrieve documents
- scoredocs – lucene ScoreDocs
- count – total number of hits
- maxscore – maximum score
- fields – optional field selectors
Groups¶
New in version 1.6.
Note
This interface is experimental and might change in incompatible ways in the next release.
GroupingSearch¶
New in version 1.5.
Note
This interface is experimental and might change in incompatible ways in the next release.
-
class
lupyne.engine.documents.
GroupingSearch
(field, sort=None, cache=True, **attrs)[source]¶ Inherited lucene GroupingSearch with optimized faceting.
Parameters: - field – unique field name to group by
- sort – lucene Sort to order groups and docs
- cache – use unlimited caching
- attrs – additional attributes to set
Field¶
Changed in version 1.6: lucene Field.{Store,Index,TermVector} dropped in favor of FieldType attributes
-
class
lupyne.engine.documents.
Field
(name, stored=False, indexed=True, boost=1.0, **settings)[source]¶ Saved parameters which can generate lucene Fields given values.
Parameters: - name – name of field
- boost – boost factor
- indexed, settings (stored,) – lucene FieldType attributes
-
settings
¶ dict representation of settings
MapField¶
-
class
lupyne.engine.documents.
MapField
(name, func, **kwargs)[source]¶ Bases:
lupyne.engine.documents.Field
Field which applies a function across its values.
Parameters: func – callable
NestedField¶
-
class
lupyne.engine.documents.
NestedField
(name, sep='.', tokenized=False, **kwargs)[source]¶ Bases:
lupyne.engine.documents.Field
Field which indexes every component into its own field. Original value may be stored for convenience.
Parameters: sep – field separator used on name and values
DocValuesField¶
New in version 1.6.
-
class
lupyne.engine.documents.
DocValuesField
(name, type)[source]¶ Bases:
lupyne.engine.documents.Field
Field which stores a per-document values, used for efficient sorting.
Parameters: - name – name of field
- type – lucene DocValuesType string
NumericField¶
Changed in version 1.5: recommended to specify initial int or float type
Changed in version 1.6: custom step removed in favor of numericPrecisionStep
-
class
lupyne.engine.documents.
NumericField
(name, type=None, tokenized=False, **kwargs)[source]¶ Bases:
lupyne.engine.documents.Field
Field which indexes numbers in a prefix tree.
Parameters: - name – name of field
- type – optional int, float, or lucene NumericType string
DateTimeField¶
-
class
lupyne.engine.documents.
DateTimeField
(name, **kwargs)[source]¶ Bases:
lupyne.engine.documents.NumericField
Field which indexes datetimes as a NumericField of timestamps. Supports datetimes, dates, and any prefix of time tuples.
-
duration
(date, days=0, **delta)[source]¶ Return date range query within time span of date.
Parameters: - date – origin date or tuple
- days,delta – timedelta parameters
-
within
(days=0, weeks=0, utc=True, **delta)[source]¶ Return date range query within current time and delta. If the delta is an exact number of days, then dates will be used.
Parameters: - days,weeks – number of days to offset from today
- utc – optionally use utc instead of local time
- delta – additional timedelta parameters
-
queries¶
Query wrappers and search utilities.
Query¶
-
class
lupyne.engine.queries.
Query
(base, *args, **attrs)[source]¶ Inherited lucene Query, with dynamic base class acquisition. Uses class methods and operator overloading for convenient query construction.
-
__and__
(other)[source]¶ BooleanQuery +self +other>
-
__or__
(other)[source]¶ BooleanQuery self other>
-
__sub__
(other)[source]¶ BooleanQuery self -other>
-
classmethod
all
(*queries, **terms)[source]¶ Return BooleanQuery (AND) from queries and terms.
-
classmethod
any
(*queries, **terms)[source]¶ Return BooleanQuery (OR) from queries and terms.
-
classmethod
disjunct
(multiplier, *queries, **terms)[source]¶ Return lucene DisjunctionMaxQuery from queries and terms.
-
static
filter
(cache=True)[source]¶ Return query as a filter, as specifically matching as possible, but defaulting to QueryWrapperFilter.
-
classmethod
multiphrase
(name, *values)[source]¶ Return lucene MultiPhraseQuery. None may be used as a placeholder.
-
classmethod
near
(name, *values, **kwargs)[source]¶ Return
SpanNearQuery
from terms. Term values which supply another field name will be masked.
-
classmethod
phrase
(name, *values, **attrs)[source]¶ Return lucene PhraseQuery. None may be used as a placeholder.
-
BooleanQuery¶
-
class
lupyne.engine.queries.
BooleanQuery
(base, *args, **attrs)[source]¶ Bases:
lupyne.engine.queries.Query
Inherited lucene BooleanQuery with sequence interface to clauses.
SpanQuery¶
-
class
lupyne.engine.queries.
SpanQuery
(base, *args, **attrs)[source]¶ Bases:
lupyne.engine.queries.Query
Inherited lucene SpanQuery with additional span constructors.
-
mask
(name)[source]¶ Return lucene FieldMaskingSpanQuery, which allows combining span queries from different fields.
-
TermsFilter¶
New in version 1.2.
-
class
lupyne.engine.queries.
TermsFilter
(field, values=())[source]¶ Bases:
CachingWrapperFilter
Caching filter based on a unique field and set of matching values. Optimized for many terms and docs, with support for incremental updates. Suitable for searching external metadata associated with indexed identifiers. Call
refresh()
to cache a new (or reopened) searcher.Parameters: - field – field name
- values – initial term values, synchronized with the cached filters
SortField¶
-
class
lupyne.engine.queries.
SortField
(name, type='string', parser=None, reverse=False)[source]¶ Bases:
SortField
Inherited lucene SortField used for caching FieldCache parsers.
Parameters: - name – field name
- type – type object or name compatible with SortField constants
- parser – lucene FieldCache.Parser or callable applied to field values
- reverse – reverse flag used with sort
-
comparator
(searcher, multi=False)[source]¶ Return indexed values from default FieldCache using the given searcher.
Highlighter¶
-
class
lupyne.engine.queries.
Highlighter
(searcher, query, field, terms=False, fields=False, tag='', formatter=None, encoder=None)[source]¶ Bases:
Highlighter
Inherited lucene Highlighter with stored analysis options.
Parameters: - searcher – IndexSearcher used for analysis, scoring, and optionally text retrieval
- query – lucene Query
- field – field name of text
- terms – highlight any matching term in query regardless of position
- fields – highlight matching terms from any field
- tag – optional html tag name
- formatter – optional lucene Formatter
- encoder – optional lucene Encoder
FastVectorHighlighter¶
-
class
lupyne.engine.queries.
FastVectorHighlighter
(searcher, query, field, terms=False, fields=False, tag='', fragListBuilder=None, fragmentsBuilder=None)[source]¶ Bases:
FastVectorHighlighter
Inherited lucene FastVectorHighlighter with stored query. Fields must be stored and have term vectors with offsets and positions.
Parameters: - searcher – IndexSearcher with stored term vectors
- query – lucene Query
- field – field name of text
- terms – highlight any matching term in query regardless of position
- fields – highlight matching terms from any field
- tag – optional html tag name
- fragListBuilder – optional lucene FragListBuilder
- fragmentsBuilder – optional lucene FragmentsBuilder
SpellChecker¶
-
class
lupyne.engine.queries.
SpellChecker
(*args, **kwargs)[source]¶ Bases:
dict
Correct spellings and suggest words for queries. Supply a vocabulary mapping words to (reverse) sort keys, such as document frequencies.
spatial¶
Geospatial fields.
Latitude/longitude coordinates are encoded into the quadkeys of MS Virtual Earth, which are also compatible with Google Maps and OSGEO Tile Map Service. See http://www.maptiler.org/google-maps-coordinates-tile-bounds-projection/.
The quadkeys are then indexed using a prefix tree, creating a cartesian tier of tiles.
Point¶
Tile¶
PointField¶
-
class
lupyne.engine.spatial.
PointField
(name, precision=30, **kwargs)[source]¶ Bases:
lupyne.engine.documents.NumericField
Geospatial points, which create a tiered index of tiles. Points must still be stored if exact distances are required upon retrieval.
Parameters: precision – zoom level, i.e., length of encoded value
PolygonField¶
-
class
lupyne.engine.spatial.
PolygonField
(name, precision=30, **kwargs)[source]¶ Bases:
lupyne.engine.spatial.PointField
PointField which implicitly supports polygons (technically linear rings of points). Differs from points in that all necessary tiles are included to match the points’ boundary. As with PointField, the tiered tiles are a search optimization, not a distance calculator.