Storage

Storage are mains ELeVE components. They manage langage model (trie of n-grams) storage and query.

ELeVE provide different storages that may be used. For a proper documentation of ELeVE‘s storage objects API please look at eleve.memory.MemoryStorage which is a full-Python implementation that provides the reference API that all other storage follow.

Two types of storage may be used in function of the needs:

  • Memory storage that are fast but limited by RAM size,
  • Disk storage that are slower but only limited by disk space.

Memory Storage

Note

To use a memory storage you should import eleve.MemoryStorage:

>>> from eleve import MemoryStorage
>>> storage = MemoryStorage()

It is an alias to the best available memory storage (should be eleve.c_memory.cmemory.MemoryStorage if the C++ extensions sucessfully compiled).

Construction

Memory storage constructor takes only one parameter: default_ngram_length It is maximum length of the n-grams that are stored:

>>> from eleve import MemoryStorage
>>> storage = MemoryStorage(default_ngram_length=3)

Training

To build a storage from a corpus the add_sentence() method should be use:

>>> storage.add_sentence(["very", "small", "black", "cat"])
>>> storage.add_sentence(["super", "small", "red", "cat"])
>>> storage.add_sentence(["big", "black", "cat"])
>>> storage.add_sentence(["crazy", "dog"])

Querying

Storages provide a method to query a n-gram count:

>>> storage.query_count(["cat"])
3
>>> storage.query_count(["black", "cat"])
2

You can notice that count are available of every size of n-grams. However n-grams larger than the storage order will have a count of zero:

>>> storage.query_count(["very", "small", "black", "cat"])
0

One can also query autonomy of an n-gram:

>>> storage.query_autonomy(["black", "cat"])
1.9537...
>>> storage.query_autonomy(["small", "black"])
-0.1965...

Note that with an order of N, autonomy of n-gram of size N-1 at maximum may be computed:

>>> storage.query_autonomy(["small", "black", "cat"])
nan

Saving and loading

Warning

For now it is not possible to save or restore a memory storage. It should be re-trained each time. This will change in a near future!

Python and C++ implementations

Two memory storage are provided: eleve.memory.MemoryStorage and eleve.c_memory.cmemory.MemoryStorage. The former is full-Python and provide the reference API, the latter is writen in C++ and is much more efficient.

Only the C++ one should be used. The best practice is to use eleve.MemoryStorage which is an alias to the C++ one that provides a fail back to the full-Python one if compilation of C++ one has failed.

Note

If you want to import and use explicitely Python or C++ memory storage, you can import it with the following alias:

>>> from eleve import PyMemoryStorage, CMemoryStorage
>>> PyMemoryStorage
<class 'eleve.memory.MemoryStorage'>
>>> CMemoryStorage
<class 'eleve.c_memory.cmemory.MemoryStorage'>

Disk Storage (Leveldb)

ELeVE provides on-disk storages. They are much slower than the memory ones but not limited by memory size. As everything is stored on-disk, they are persistent, and can be restored without loading. The on-disk storage internally use LevelDB to store the model.

Note

To use a disk storage you should import eleve.LeveldbStorage:

>>> from eleve import LeveldbStorage
>>> hdd_storage = LeveldbStorage(path="./tmp_storage")

It is an alias to the best available disk storage (should be eleve.c_leveldb.cleveldb.LeveldbStorage).

Use that storage in two cases:

  • If you want to create a model for a HUGE training corpus that don’t fit in RAM.
  • If you don’t want to re-train your model everytime on a corpus everytime you use it. Be aware that it may be faster to re-train it each time in RAM, because the query time for the Leveldb storage is higher.

Warning

You can’t create more than one instance of a storage for the specific path. Leveldb use locking, so if two process try to access the same database, the second will fail.

The API is the same as for the Memory storage. Only the constructor changes.

Construction, save, load and clear

Disk storage constructor takes a default_ngram_length parameter, like the memory storage. It also needs a path, where the model data will be stored on disk:

>>> from eleve import LeveldbStorage
>>> hdd_storage = LeveldbStorage("./tmp_storage", default_ngram_length=3)

Then, everything is the same than with memory storage:

>>> hdd_storage.add_sentence(["very", "small", "black", "cat"])
>>> hdd_storage.add_sentence(["super", "small", "red", "cat"])
>>> hdd_storage.add_sentence(["big", "black", "cat"])
>>> hdd_storage.add_sentence(["crazy", "dog"])
>>> hdd_storage.query_count(["black", "cat"])
2
>>> hdd_storage.query_count(["very", "small", "black", "cat"])
0
>>> hdd_storage.query_autonomy(["black", "cat"])
1.9537...
>>> hdd_storage.query_autonomy(["small", "black"])
-0.1965...
>>> hdd_storage.query_autonomy(["small", "black", "cat"])
nan

It is possible to open a storage from an existing path on the disk:

>>> hdd_storage.close() # can not be open twice, so we need to close it
>>>
>>> hdd_storage2 = LeveldbStorage("./tmp_storage")
>>> hdd_storage2.query_autonomy(["black", "cat"])
1.9537...
>>> hdd_storage2.query_autonomy(["small", "black"])
-0.1965...

Note that there is no method for saving, as everything is saved and modified on the fly.

Finaly if you want to remove everything that was entered in a storage (to say, clear it)

>>> hdd_storage2.clear()
>>> hdd_storage2.query_autonomy(["black", "cat"])
nan

Python and C++ implementations

Two implementations of disk storage are provided: eleve.leveldb.LeveldbStorage and eleve.c_leveldb.cleveldb.LeveldbStorage. The former is written in Python (for eleve parts) and use a generic leveldb wrapper plyvel, the latter is fully written in C++ and directly uses leveldb C++ API.

C++ version is a bit faster and more efficient than the Python version.

Note

If you want to import and use explicitely Python or C++ implementation of disk storage, you can import it with the following alias:

>>> from eleve import PyLeveldbStorage, CLeveldbStorage
>>> PyLeveldbStorage
<class 'eleve.leveldb.LeveldbStorage'>
>>> CLeveldbStorage
<class 'eleve.c_leveldb.cleveldb.LeveldbStorage'>