Corpora module is very simple so far and consists only of one class.
Corpus class is responsible for creating new corpus and also represents a corpus as an object
Exception raised when appending document with duplicate id
Exception raised when document is to big to fit chunk file.
Appending new document to a corpus.
Static method for creating new corpus in the given path. Additional properties can be given as named arguments.
Get random document from a corpus.
Get document pointed by idx structure which is offset information in chunk file.
Getter for chunk. Default chunk is current_chunk. Method caches opened chunk files.
Return tuple (chunk, offset, header len, text len) for given index
Return index of idx for given key
Creates new chunk with next sequential chunk number.
Saving properties of corpora to config file.
Saving all indexes to apropriate files.
Tests if new_size data will fit into current chunk.
The only argument path should be pointed on directory containing corpus.
To create new corpus use static method:
Static method for creating new corpus in the given path. Additional properties can be given as named arguments.
Example:
Corpus.create('/tmp/test_corpus', name="My test corpus", chunk_size=1024*1024*10)
This will create an empty corpus with 10MB chunk size and name My test corpus in directory /tmp/test_corpus.
Appending new document to a corpus.
Warning
you should assume that ident will be converted to string, so 1 and ‘1’ are the same ident and are not unique.
Example:
c = Corpus('/tmp/test_corpus')
c.add(u'This is some text', 1, fetched_from_url="http://some.test/place", related_documents=[1,2,3])
c.add(u'Other text', 2, is_interesting=False)
Note
documents are saved with order of appending them, this means that if you add 3 documents with id like 2, 1, 3 there will be served in the same order while accessed sequentially.
Warning
as you can see you can add any header to document. There is no pre-configuration what can be set as document header. This is very flexible, but in the same time can lead to problem with consistency of headers among all documents collections. Be sure that you append this same headers to every document in corpus or write your code in a way that will deal with KeyError from missing headers.
After adding new documents to a corpus you need to sync indexes to a filesystem.
Saving all indexes to apropriate files.
c.save_indexes()
Typical use of a corpus is to sequentially access all documents one-by-one. Corpora supports operation with generators.
Example:
c = Corpus('/tmp/test_corpus')
for (headers, text) in c:
... some processing
This will read a file chunks sequentially what should be as fast as possible.
There is also a possibility to access a given document pointed by it’s id.
Interface for get method
Get random document from a corpus.
Examples:
c = Corpus('/tmp/test_corpus')
print c[1]
print c.get(1)
Both lines will print the same document tuple (if exists).
Standard python len is used.
Returns size of document collection
Example:
c = Corpus('/tmp/test_corpus')
print len(c)
Exception raised when document is to big to fit chunk file.
Exception raised when appending document with duplicate id