Trees | Indices | Help |
|
---|
|
ArabicLightStemmer: a class which proved a configurable stemmer and segmentor for arabic text.
Contact: taha dot zerrouki at gmail dot com
Copyright: Arabtechies, Arabeyes, Taha Zerrouki
License: GPL
Date: 2017/02/15
Version: 0.3
|
|||
|
|||
Attribut Functions | |||
---|---|---|---|
unicode. |
|
||
|
|||
unicode. |
|
||
|
|||
unicode. |
|
||
|
|||
unicode. |
|
||
|
|||
integer. |
|
||
|
|||
integer. |
|
||
|
|||
integer. |
|
||
|
|||
set(). |
|
||
|
|||
set(). |
|
||
|
|||
|
|||
unicode. |
|
||
Calculated Attribut Functions | |||
unicode. |
|
||
unicode. |
|
||
unicode. |
|
||
unicode. |
|
||
integer. |
|
||
integer. |
|
||
unicode. |
|
||
unicode. |
|
||
unicode. |
|
||
unicode. |
|
||
unicode. |
|
||
dict. |
|
||
Stemming Functions | |||
unicode. |
|
||
tuple. |
|
||
unicode. |
|
||
Tree stucture |
|
||
Tree stucture |
|
||
list of int |
|
||
list of int |
|
||
Segmentation Functions | |||
set of tuple of integer. |
|
||
set of tuple of integer. |
|
||
list of dict. |
|
||
General Functions | |||
unicode. |
|
||
list. |
|
|
return the prefixation letters. This constant take DEFAULT_PREFIX_LETTERS by default.
|
set the prefixation letters. This constant take DEFAULT_PREFIX_LETTERS by default.
|
return the suffixation letters. This constant take DEFAULT_SUFFIX_LETTERS by default.
|
set the suffixation letters. This constant take DEFAULT_SUFFIX_LETTERS by default.
|
get the inffixation letters. This constant take DEFAULT_INFIX_LETTERS by default.
|
set the inffixation letters. This constant take DEFAULT_INFIX_LETTERS by default.
|
get the joker letter. This constant take DEFAULT_JOKER by default.
|
set the joker letter. This constant take DEFAULT_JOKER by default.
|
return the constant of max length of the prefix used by the stemmer. This constant take DEFAULT_MAX_PREFIX_LENGTH by default.
|
Set the constant of max length of the prefix used by the stemmer. This constant take DEFAULT_MAX_PREFIX_LENGTH by default.
|
return the constant of max length of the suffix used by the stemmer. This constant take DEFAULT_MAX_SUFFIX_LENGTH by default.
|
Set the constant of max length of the suffix used by the stemmer. This constant take DEFAULT_MAX_SUFFIX_LENGTH by default.
|
return the constant of min length of the stem used by the stemmer. This constant take DEFAULT_MIN_STEM_LENGTH by default.
|
Set the constant of min length of the stem used by the stemmer. This constant take DEFAULT_MIN_STEM_LENGTH by default.
|
return the prefixes list used by the stemmer. This constant take DEFAULT_PREFIX_LIST by default.
|
Set prefixes list used by the stemmer. This constant take DEFAULT_PREFIX_LIST by default.
|
return the suffixes list used by the stemmer. This constant take DEFAULT_SUFFIX_LIST by default.
|
Set suffixes list used by the stemmer. This constant take DEFAULT_SUFFIX_LIST by default.
|
Set the word to treat by the stemmer.
|
return the last word treated by the stemmer.
|
return the starlike word treated by the stemmer. All non affix letters are converted to a joker. The joker take by default DEFAULT_JOKER = "*". Exmaple: >>> ArListem = ArabicLightStemmer() >>> word = u'أفتصربونني' >>> stem = ArListem.lightStem(word) >>> print ArListem.get_starword() أفت***ونني
|
return the root of the treated word by the stemmer. All non affix letters are converted to a joker. All letters in the joker places are part of root. The joker take by default DEFAULT_JOKER = "*". Example: >>> ArListem = ArabicLightStemmer() >>> word = u'أفتصربونني' >>> stem = ArListem.lightStem(word) >>> print ArListem.get_starword() أفت***ونني >>> print ArListem.get_root() ضرب
|
return the normalized form of the treated word by the stemmer. Some letters are converted into normal form like Hamzat. Example: >>> word = u"استؤجرُ" >>> ArListem = ArabicLightStemmer() >>> stem = ArListem.lightStem(word) >>> print ArListem.get_normalized() استءجر
|
return the unvocalized form of the treated word by the stemmer. Harakat are striped. Example: >>> word = u"الْعَرَبِيّةُ" >>> ArListem = ArabicLightStemmer() >>> stem = ArListem.lightStem(word) >>> print ArListem.get_unvocalized() العربية
|
return the the left position of stemming (prefixe end position )in the word treated word by the stemmer. Example: >>> ArListem = ArabicLightStemmer() >>> word = u'أفتصربونني' >>> stem = ArListem.lightStem(word) >>> print ArListem.get_starword() أفت***ونني >>> print ArListem.get_left() 3
|
return the the right position of stemming (suffixe start position )in the word treated word by the stemmer. Example: >>> ArListem = ArabicLightStemmer() >>> word = u'أفتصربونني' >>> stem = ArListem.lightStem(word) >>> print ArListem.get_starword() أفت***ونني >>> print ArListem.get_right() 6
|
return the stem of the treated word by the stemmer. Example: >>> ArListem = ArabicLightStemmer() >>> word = u'أفتكاتبانني' >>> stem = ArListem.lightStem(word) >>> print ArListem.get_stem() كاتب
|
return the star form stem of the treated word by the stemmer. All non affix letters are converted to a joker. The joker take by default DEFAULT_JOKER = "*". Example: >>> ArListem = ArabicLightStemmer() >>> word = u'أفتكاتبانني' >>> stem = ArListem.lightStem(word) >>> print ArListem.get_stem() كاتب >>> print ArListem.get_starstem() *ات*
|
return the prefix of the treated word by the stemmer. Example: >>> ArListem = ArabicLightStemmer() >>> word = u'أفتكاتبانني' >>> stem = ArListem.lightStem(word) >>> print ArListem.get_prefix() أفت
|
return the suffix of the treated word by the stemmer. Example: >>> ArListem = ArabicLightStemmer() >>> word = u'أفتكاتبانني' >>> stem = ArListem.lightStem(word) >>> print ArListem.get_suffix() انني
|
return the affix of the treated word by the stemmer. Example: >>> ArListem = ArabicLightStemmer() >>> word = u'أفتكاتبانني' >>> stem = ArListem.lightStem(word) >>> print ArListem.get_affix() أفت-انني
|
return the affix tuple of the treated word by the stemmer. Example: >>> ArListem = ArabicLightStemmer() >>> word = u'أفتضاربانني' >>> stem = ArListem.lightStem(word) >>> print ArListem.get_affix_tuple() {'prefix': u'أفت', 'root': u'ضرب', 'suffix': u'انني', 'stem': u'ضارب'}
|
Stemming function, stem an arabic word, and return a stem. This function store in the instance the stemming positions (left, right), then it's possible to get other calculted attributs like : stem, prefixe, suffixe, root. Example: >>> ArListem = ArabicLightStemmer() >>> word = u'أفتضاربانني' >>> stem = ArListem.light_stem(word) >>> print ArListem.get_stem() ضارب >>> print ArListem.get_starstem() *ا** >>> print ArListem.get_left() 3 >>> print ArListem.get_right() 6 >>> print ArListem.get_root() ضرب
|
Transform all non affixation letters into a star. the star is a joker(by default '*'). which indicates that the correspandent letter is an original. this function is used by the stmmer to identify original letters. and return a stared form and stemming positions (left, right) Example: >>> ArListem = ArabicLightStemmer() >>> word = u'أفتضاربانني' >>> starword, left, right = ArListem.transformToStrars(word) (أفت*ا**انني, 3, 6)
|
return the root of the treated word by the stemmer. All non affix letters are converted to a joker. All letters in the joker places are part of root. The joker take by default DEFAULT_JOKER = "*". Example: >>> ArListem = ArabicLightStemmer() >>> word = u'أفتصربونني' >>> stem = ArListem.lightStem(word) >>> print ArListem.get_starword() أفت***ونني >>> print ArListem.get_root() ضرب
|
Create a prefixes tree from given prefixes list
|
Create a suffixes tree from given suffixes list
|
lookup for prefixes in the word
|
lookup for suffixes in the word
|
generate a list of all posibble segmentation positions (lef, right) of the treated word by the stemmer. Example: >>> ArListem = ArabicLightStemmer() >>> word = u'فتضربين' >>> print ArListem.segment(word) set(([(1, 5), (2, 5), (0, 7)])
|
return a list of segmentation positions (left, right) of the treated word by the stemmer. Example: >>> ArListem = ArabicLightStemmer() >>> word = u'فتضربين' >>> ArListem.segment(word) >>> print ArListem.get_segment_list() set(([(1, 5), (2, 5), (0, 7)])
|
return a list of affix tuple of the treated word by the stemmer. Example: >>> ArListem = ArabicLightStemmer() >>> word = u'فتضربين' >>> ArListem.segment(word) >>> print ArListem.get_affix_list() [{'prefix': u'ف', 'root': u'ضرب', 'suffix': u'ين', 'stem': u'تضرب'}, {'prefix': u'فت', 'root': u'ضرب', 'suffix': u'ين', 'stem': u'ضرب'}, {'prefix': u'', 'root': u'فضربن', 'suffix': u'', 'stem': u'فتضربين'}]
|
Normalize a word. Convert some leters forms into unified form.
|
Tokenize text into words
|
Trees | Indices | Help |
|
---|
Generated by Epydoc 3.0.1 on Wed Feb 15 13:46:35 2017 | http://epydoc.sourceforge.net |