ELeVE documentation¶
Welcome to ELeVE’s documentation. After the table of content that follows, you will find a documentation to help you getting started (instillation, basic usage, ...).
What is ELeVE ?¶
ELeVE is a library for calculating a specialized language model from a corpus of text.
It allows you to use statistics from the training corpus to calculate branching entropy, and autonomy measures for n-grams of text. See [MagistrySagot2012] for a definiton of these terms (autonomy is also called « nVBE » for « normalized entropy variation »)
It was mainly developed for segmentation of mandarin Chinese, but was successfully used to research on other tasks like keyphrase extraction.
In a nutshell¶
Here is a simple “getting started”. First you have to train a model:
>>> from eleve import MemoryStorage
>>> storage = MemoryStorage()
>>> # Then the training itself:
>>> storage.add_sentence(["I", "like", "New", "York", "city"])
>>> storage.add_sentence(["I", "like", "potatoes"])
>>> storage.add_sentence(["potatoes", "are", "fine"])
>>> storage.add_sentence(["New", "York", "is", "a", "fine", "city"])
And then you cat query it:
>>> storage.query_autonomy(["New", "York"])
>>> storage.query_autonomy(["like", "potatoes"])
Eleve also store n-gram’s occurence count:
>>> storage.query_count(["New", "York"])
>>> storage.query_count(["New", "potatoes"])
>>> storage.query_count(["I", "like", "potatoes"])
>>> storage.query_count(["potatoes"])
Then, you can use it for segmentation:
>>> from eleve import Segmenter
>>> s = Segmenter(storage)
>>> # segment up to 4-grams, if we used the same storage as before.
>>> s.segment(["What", "do", "you", "know", "about", "New", "York"])
[['What'], ['do'], ['you'], ['know'], ['about'], ['New', 'York']]
You will need some dependencies. On Ubuntu:
$ sudo apt-get install libboost-python-dev libboost-filesystem-dev libleveldb-dev
Then to install eleve:
$ pip install eleve
or if you have a local clone of source folder:
$ python setup.py install
Install the development environment:
$ git clone https://github.com/kodexlab/eleve
$ cd eleve
$ virtualenv ENV -p /usr/bin/python3
$ source ENV/bin/activate
$ pip install -r requirements.txt
$ pip install -r requirements.dev.txt
Pull requests are welcome!
To run tests:
$ make testall
To build the doc:
$ make doc
then open: docs/_build/html/index.html
Warning: You need to have eleve
accesible in the python path to run tests (and to build doc).
For that you can install eleve
as a link in local virtualenv:
$ pip install -e .
(Note: this is indicated in pytest good practice )
If you use eleve
for an academic publication, please cite this paper:
[MagistrySagot2012] | Magistry, P., & Sagot, B. (2012, July). Unsupervized word segmentation: the case for mandarin chinese. In Proceedings of the 50th Annual Meeting of the ACL: Short Papers-Volume 2 (pp. 383-387). http://www.aclweb.org/anthology/P12-2075 |
Copyright, license and authors¶
Copyright (C) 2014-2015 Kodex⋅Lab.
is available under the LGPL Version 3 license.
was originaly designed and prototyped by Pierre Magistry during its PhD. It then has been completly rewriten by Korantin Auguste and Emmanuel Navarro (with the help of Pierre).