Package tashaphyne :: Module stemming :: Class ArabicLightStemmer
[hide private]
[frames] | no frames]

Class ArabicLightStemmer

source code

ArabicLightStemmer: a class which proved a configurable stemmer and segmentor for arabic text.

Features:


Authors:
Taha Zerrouki <taha_zerrouki at gmail dot com>, Taha Zerrouki

Contact: taha dot zerrouki at gmail dot com

Copyright: Arabtechies, Arabeyes, Taha Zerrouki

License: GPL

Date: 2017/02/15

Version: 0.3

Instance Methods [hide private]
 
__init__(self) source code
    Attribut Functions
unicode.
get_prefix_letters(self)
return the prefixation letters.
source code
 
set_prefix_letters(self, new_prefix_letters)
set the prefixation letters.
source code
unicode.
get_suffix_letters(self)
return the suffixation letters.
source code
 
set_suffix_letters(self, new_suffix_letters)
set the suffixation letters.
source code
unicode.
get_infix_letters(self)
get the inffixation letters.
source code
 
set_infix_letters(self, new_infix_letters)
set the inffixation letters.
source code
unicode.
get_joker(self)
get the joker letter.
source code
 
set_joker(self, new_joker)
set the joker letter.
source code
integer.
get_max_prefix_length(self)
return the constant of max length of the prefix used by the stemmer.
source code
 
set_max_prefix_length(self, new_max_prefix_length)
Set the constant of max length of the prefix used by the stemmer.
source code
integer.
get_max_suffix_length(self)
return the constant of max length of the suffix used by the stemmer.
source code
 
set_max_suffix_length(self, new_max_suffix_length)
Set the constant of max length of the suffix used by the stemmer.
source code
integer.
get_min_stem_length(self)
return the constant of min length of the stem used by the stemmer.
source code
 
set_min_stem_length(self, new_min_stem_length)
Set the constant of min length of the stem used by the stemmer.
source code
set().
get_prefix_list(self)
return the prefixes list used by the stemmer.
source code
 
set_prefix_list(self, new_prefix_list)
Set prefixes list used by the stemmer.
source code
set().
get_suffix_list(self)
return the suffixes list used by the stemmer.
source code
 
set_suffix_list(self, new_suffix_list)
Set suffixes list used by the stemmer.
source code
 
set_word(self, new_word)
Set the word to treat by the stemmer.
source code
unicode.
get_word(self)
return the last word treated by the stemmer.
source code
    Calculated Attribut Functions
unicode.
get_starword(self)
return the starlike word treated by the stemmer.
source code
unicode.
get_root(self, prefix_index=-1, suffix_index=-1)
return the root of the treated word by the stemmer.
source code
unicode.
get_normalized(self)
return the normalized form of the treated word by the stemmer.
source code
unicode.
get_unvocalized(self)
return the unvocalized form of the treated word by the stemmer.
source code
integer.
get_left(self)
return the the left position of stemming (prefixe end position )in the word treated word by the stemmer.
source code
integer.
get_right(self)
return the the right position of stemming (suffixe start position )in the word treated word by the stemmer.
source code
unicode.
get_stem(self, prefix_index=-1, suffix_index=-1)
return the stem of the treated word by the stemmer.
source code
unicode.
get_starstem(self, prefix_index=-1, suffix_index=-1)
return the star form stem of the treated word by the stemmer.
source code
unicode.
get_prefix(self, prefix_index=-1)
return the prefix of the treated word by the stemmer.
source code
unicode.
get_suffix(self, suffix_index=-1)
return the suffix of the treated word by the stemmer.
source code
unicode.
get_affix(self, prefix_index=-1, suffix_index=-1)
return the affix of the treated word by the stemmer.
source code
dict.
get_affix_tuple(self, prefix_index=-1, suffix_index=0)
return the affix tuple of the treated word by the stemmer.
source code
    Stemming Functions
unicode.
light_stem(self, word)
Stemming function, stem an arabic word, and return a stem.
source code
tuple.
transform2stars(self, word)
Transform all non affixation letters into a star.
source code
unicode.
extract_root(self, prefix_index=-1, suffix_index=-1)
return the root of the treated word by the stemmer.
source code
Tree stucture
_create_prefix_tree(self, prefixes)
Create a prefixes tree from given prefixes list
source code
Tree stucture
_create_suffix_tree(self, suffixes)
Create a suffixes tree from given suffixes list
source code
list of int
lookup_prefixes(self, word)
lookup for prefixes in the word
source code
list of int
lookup_suffixes(self, word)
lookup for suffixes in the word
source code
    Segmentation Functions
set of tuple of integer.
segment(self, word)
generate a list of all posibble segmentation positions (lef, right) of the treated word by the stemmer.
source code
set of tuple of integer.
get_segment_list(self)
return a list of segmentation positions (left, right) of the treated word by the stemmer.
source code
list of dict.
get_affix_list(self)
return a list of affix tuple of the treated word by the stemmer.
source code
    General Functions
unicode.
normalize(self, word=u'')
Normalize a word.
source code
list.
tokenize(self, text=u'')
Tokenize text into words
source code
Method Details [hide private]

get_prefix_letters(self)

source code 

return the prefixation letters. This constant take DEFAULT_PREFIX_LETTERS by default.

Returns: unicode.
return a letters.

set_prefix_letters(self, new_prefix_letters)

source code 

set the prefixation letters. This constant take DEFAULT_PREFIX_LETTERS by default.

Parameters:
  • new_prefix_letters (unicode.) - letters to be striped from a word, e.g.new_prefix_letters = u"وف":.

get_suffix_letters(self)

source code 

return the suffixation letters. This constant take DEFAULT_SUFFIX_LETTERS by default.

Returns: unicode.
return a letters.

set_suffix_letters(self, new_suffix_letters)

source code 

set the suffixation letters. This constant take DEFAULT_SUFFIX_LETTERS by default.

Parameters:
  • new_suffix_letters (unicode.) - letters to be striped from the end of a word, e.g.new_suffix_letters = u"ةون":.

get_infix_letters(self)

source code 

get the inffixation letters. This constant take DEFAULT_INFIX_LETTERS by default.

Returns: unicode.
infixes letters.

set_infix_letters(self, new_infix_letters)

source code 

set the inffixation letters. This constant take DEFAULT_INFIX_LETTERS by default.

Parameters:
  • new_infix_letters (unicode.) - letters to be striped from the middle of a word, e.g.new_infix_letters = u"أوي":.

get_joker(self)

source code 

get the joker letter. This constant take DEFAULT_JOKER by default.

Returns: unicode.
joker letter.

set_joker(self, new_joker)

source code 

set the joker letter. This constant take DEFAULT_JOKER by default.

Parameters:
  • new_joker (unicode.) - joker letter.

get_max_prefix_length(self)

source code 

return the constant of max length of the prefix used by the stemmer. This constant take DEFAULT_MAX_PREFIX_LENGTH by default.

Returns: integer.
return a number.

set_max_prefix_length(self, new_max_prefix_length)

source code 

Set the constant of max length of the prefix used by the stemmer. This constant take DEFAULT_MAX_PREFIX_LENGTH by default.

Parameters:
  • new_max_prefix_length (integer.) - the new max prefix length constant.

get_max_suffix_length(self)

source code 

return the constant of max length of the suffix used by the stemmer. This constant take DEFAULT_MAX_SUFFIX_LENGTH by default.

Returns: integer.
return a number.

set_max_suffix_length(self, new_max_suffix_length)

source code 

Set the constant of max length of the suffix used by the stemmer. This constant take DEFAULT_MAX_SUFFIX_LENGTH by default.

Parameters:
  • new_max_suffix_length (integer.) - the new max suffix length constant.

get_min_stem_length(self)

source code 

return the constant of min length of the stem used by the stemmer. This constant take DEFAULT_MIN_STEM_LENGTH by default.

Returns: integer.
return a number.

set_min_stem_length(self, new_min_stem_length)

source code 

Set the constant of min length of the stem used by the stemmer. This constant take DEFAULT_MIN_STEM_LENGTH by default.

Parameters:
  • new_min_stem_length (integer.) - the min stem length constant.

get_prefix_list(self)

source code 

return the prefixes list used by the stemmer. This constant take DEFAULT_PREFIX_LIST by default.

Returns: set().
prefixes list.

set_prefix_list(self, new_prefix_list)

source code 

Set prefixes list used by the stemmer. This constant take DEFAULT_PREFIX_LIST by default.

Parameters:
  • new_prefix_list (set of unicode string.) - a set of prefixes.

get_suffix_list(self)

source code 

return the suffixes list used by the stemmer. This constant take DEFAULT_SUFFIX_LIST by default.

Returns: set().
suffixes list.

set_suffix_list(self, new_suffix_list)

source code 

Set suffixes list used by the stemmer. This constant take DEFAULT_SUFFIX_LIST by default.

Parameters:
  • new_suffix_list (set of unicode string.) - a set of suffixes.

set_word(self, new_word)

source code 

Set the word to treat by the stemmer.

Parameters:
  • new_word (unicode.) - the new word.

get_word(self)

source code 

return the last word treated by the stemmer.

Returns: unicode.
word.

get_starword(self)

source code 

return the starlike word treated by the stemmer. All non affix letters are converted to a joker. The joker take by default DEFAULT_JOKER = "*".

Exmaple:

>>> ArListem = ArabicLightStemmer()
>>> word = u'أفتصربونني'
>>> stem = ArListem.lightStem(word)
>>> print ArListem.get_starword()
أفت***ونني
Returns: unicode.
word.

get_root(self, prefix_index=-1, suffix_index=-1)

source code 

return the root of the treated word by the stemmer. All non affix letters are converted to a joker. All letters in the joker places are part of root. The joker take by default DEFAULT_JOKER = "*".

Example:

>>> ArListem = ArabicLightStemmer()
>>> word = u'أفتصربونني'
>>> stem = ArListem.lightStem(word)
>>> print ArListem.get_starword()
أفت***ونني
>>> print ArListem.get_root()
ضرب
Parameters:
  • prefix_index (integer.) - indicate the left stemming position if = -1: not cosidered, and take the default word prefix lentgh.
  • suffix_index (integer.) - indicate the right stemming position. if = -1: not cosidered, and take the default word suffix position.
Returns: unicode.
root.

get_normalized(self)

source code 

return the normalized form of the treated word by the stemmer. Some letters are converted into normal form like Hamzat.

Example:

>>> word = u"استؤجرُ"
>>> ArListem = ArabicLightStemmer()
>>> stem = ArListem.lightStem(word)
>>> print ArListem.get_normalized()
استءجر
Returns: unicode.
normalized word.

get_unvocalized(self)

source code 

return the unvocalized form of the treated word by the stemmer. Harakat are striped.

Example:

>>> word = u"الْعَرَبِيّةُ"
>>> ArListem = ArabicLightStemmer()
>>> stem = ArListem.lightStem(word)
>>> print ArListem.get_unvocalized()
العربية
Returns: unicode.
unvocalized word.

get_left(self)

source code 

return the the left position of stemming (prefixe end position )in the word treated word by the stemmer.

Example:

>>> ArListem = ArabicLightStemmer()
>>> word = u'أفتصربونني'
>>> stem = ArListem.lightStem(word)
>>> print ArListem.get_starword()
أفت***ونني
>>> print ArListem.get_left()
3
Returns: integer.
the left position of stemming.

get_right(self)

source code 

return the the right position of stemming (suffixe start position )in the word treated word by the stemmer.

Example:

>>> ArListem = ArabicLightStemmer()
>>> word = u'أفتصربونني'
>>> stem = ArListem.lightStem(word)
>>> print ArListem.get_starword()
أفت***ونني
>>> print ArListem.get_right()
6
Returns: integer.
the right position of stemming.

get_stem(self, prefix_index=-1, suffix_index=-1)

source code 

return the stem of the treated word by the stemmer.

Example:

>>> ArListem = ArabicLightStemmer()
>>> word = u'أفتكاتبانني'
>>> stem = ArListem.lightStem(word)
>>> print ArListem.get_stem()
كاتب
Parameters:
  • prefix_index (integer.) - indicate the left stemming position if = -1: not cosidered, and take the default word prefix lentgh.
  • suffix_index (integer.) - indicate the right stemming position. if = -1: not cosidered, and take the default word suffix position.
Returns: unicode.
stem.

get_starstem(self, prefix_index=-1, suffix_index=-1)

source code 

return the star form stem of the treated word by the stemmer. All non affix letters are converted to a joker. The joker take by default DEFAULT_JOKER = "*".

Example:

>>> ArListem = ArabicLightStemmer()
>>> word = u'أفتكاتبانني'
>>> stem = ArListem.lightStem(word)
>>> print ArListem.get_stem()
كاتب
>>> print ArListem.get_starstem()
*ات*
Parameters:
  • prefix_index (integer.) - indicate the left stemming position if = -1: not cosidered, and take the default word prefix lentgh.
  • suffix_index (integer.) - indicate the right stemming position. if = -1: not cosidered, and take the default word suffix position.
Returns: unicode.
stared form of stem.

get_prefix(self, prefix_index=-1)

source code 

return the prefix of the treated word by the stemmer.

Example:

>>> ArListem = ArabicLightStemmer()
>>> word = u'أفتكاتبانني'
>>> stem = ArListem.lightStem(word)
>>> print ArListem.get_prefix()
أفت
Parameters:
  • prefix_index (integer.) - indicate the left stemming position if = -1: not cosidered, and take the default word prefix lentgh.
Returns: unicode.
prefixe.

get_suffix(self, suffix_index=-1)

source code 

return the suffix of the treated word by the stemmer.

Example:

>>> ArListem = ArabicLightStemmer()
>>> word = u'أفتكاتبانني'
>>> stem = ArListem.lightStem(word)
>>> print ArListem.get_suffix()
انني
Parameters:
  • suffix_index (integer.) - indicate the right stemming position. if = -1: not cosidered, and take the default word suffix position.
Returns: unicode.
suffixe.

get_affix(self, prefix_index=-1, suffix_index=-1)

source code 

return the affix of the treated word by the stemmer.

Example:

>>> ArListem = ArabicLightStemmer()
>>> word = u'أفتكاتبانني'
>>> stem = ArListem.lightStem(word)
>>> print ArListem.get_affix()
أفت-انني
Parameters:
  • prefix_index (integer.) - indicate the left stemming position if = -1: not cosidered, and take the default word prefix lentgh.
  • suffix_index (in4teger.) - indicate the right stemming position. if = -1: not cosidered, and take the default word suffix position.
Returns: unicode.
suffixe.

get_affix_tuple(self, prefix_index=-1, suffix_index=0)

source code 

return the affix tuple of the treated word by the stemmer.

Example:

>>> ArListem = ArabicLightStemmer()
>>> word = u'أفتضاربانني'
>>> stem = ArListem.lightStem(word)
>>> print ArListem.get_affix_tuple()
{'prefix': u'أفت', 'root': u'ضرب', 'suffix': u'انني', 'stem': u'ضارب'}
Parameters:
  • prefix_index (integer.) - indicate the left stemming position if = -1: not cosidered, and take the default word prefix lentgh.
  • suffix_index (integer.) - indicate the right stemming position. if = -1: not cosidered, and take the default word suffix position.
Returns: dict.
affix tuple.

light_stem(self, word)

source code 

Stemming function, stem an arabic word, and return a stem. This function store in the instance the stemming positions (left, right), then it's possible to get other calculted attributs like : stem, prefixe, suffixe, root.

Example:

>>> ArListem = ArabicLightStemmer()
>>> word = u'أفتضاربانني'
>>> stem = ArListem.light_stem(word)
>>> print ArListem.get_stem()
ضارب
>>> print ArListem.get_starstem()
*ا**
>>> print ArListem.get_left()
3
>>> print ArListem.get_right()
6
>>> print ArListem.get_root()
ضرب
Parameters:
  • word (unicode.) - the input word.
Returns: unicode.
stem.

transform2stars(self, word)

source code 

Transform all non affixation letters into a star. the star is a joker(by default '*'). which indicates that the correspandent letter is an original. this function is used by the stmmer to identify original letters. and return a stared form and stemming positions (left, right)

Example:

>>> ArListem = ArabicLightStemmer()
>>> word = u'أفتضاربانني'
>>> starword, left, right = ArListem.transformToStrars(word)
(أفت*ا**انني, 3, 6)
Parameters:
  • word (unicode) - the input word.
Returns: tuple.
(starword, left, right):
  • starword : all original letters converted into a star
  • left : the greater possible left stemming position.
  • right : the greater possible right stemming position.

extract_root(self, prefix_index=-1, suffix_index=-1)

source code 

return the root of the treated word by the stemmer. All non affix letters are converted to a joker. All letters in the joker places are part of root. The joker take by default DEFAULT_JOKER = "*".

Example:

>>> ArListem = ArabicLightStemmer()
>>> word = u'أفتصربونني'
>>> stem = ArListem.lightStem(word)
>>> print ArListem.get_starword()
أفت***ونني
>>> print ArListem.get_root()
ضرب
Parameters:
  • prefix_index (integer.) - indicate the left stemming position if = -1: not cosidered, and take the default word prefix lentgh.
  • suffix_index (integer.) - indicate the right stemming position. if = -1: not cosidered, and take the default word suffix position.
Returns: unicode.
root.

_create_prefix_tree(self, prefixes)

source code 

Create a prefixes tree from given prefixes list

Parameters:
  • prefixes (list of unicode @return : prefixes tree) - list of prefixes
Returns: Tree stucture

_create_suffix_tree(self, suffixes)

source code 

Create a suffixes tree from given suffixes list

Parameters:
  • suffixes (list of unicode @return : suffixes tree) - list of suffixes
Returns: Tree stucture

lookup_prefixes(self, word)

source code 

lookup for prefixes in the word

Parameters:
  • word (unicode @return : list of prefixes starts positions) - the given word
Returns: list of int

lookup_suffixes(self, word)

source code 

lookup for suffixes in the word

Parameters:
  • word (unicode @return : list of suffixes starts positions) - the given word
Returns: list of int

segment(self, word)

source code 

generate a list of all posibble segmentation positions (lef, right) of the treated word by the stemmer.

Example:

>>> ArListem = ArabicLightStemmer()
>>> word = u'فتضربين'
>>> print ArListem.segment(word)
set(([(1, 5), (2, 5), (0, 7)])
Returns: set of tuple of integer.
List of segmentation

get_segment_list(self)

source code 

return a list of segmentation positions (left, right) of the treated word by the stemmer.

Example:

>>> ArListem = ArabicLightStemmer()
>>> word = u'فتضربين'
>>> ArListem.segment(word)
>>> print ArListem.get_segment_list()
set(([(1, 5), (2, 5), (0, 7)])
Returns: set of tuple of integer.
List of segmentation

get_affix_list(self)

source code 

return a list of affix tuple of the treated word by the stemmer.

Example:

>>> ArListem = ArabicLightStemmer()
>>> word = u'فتضربين'
>>> ArListem.segment(word)
>>> print ArListem.get_affix_list()
[{'prefix': u'ف', 'root': u'ضرب', 'suffix': u'ين', 'stem': u'تضرب'}, 
{'prefix': u'فت', 'root': u'ضرب', 'suffix': u'ين', 'stem': u'ضرب'}, 
{'prefix': u'', 'root': u'فضربن', 'suffix': u'', 'stem': u'فتضربين'}]
Returns: list of dict.
List of Affixes tuple

normalize(self, word=u'')

source code 

Normalize a word. Convert some leters forms into unified form.

Parameters:
  • word (unicode.) - the input word, if word is empty, the word member of the class is normalized.
Returns: unicode.
normalized word.

tokenize(self, text=u'')

source code 

Tokenize text into words

Parameters:
  • text (unicode.) - the input text.
Returns: list.
list of words.