Package tashaphyne :: Module stemming :: Class ArabicLightStemmer

[frames] | no frames]

Class ArabicLightStemmer

source code

ArabicLightStemmer: a class which proved a configurable stemmer and segmentor for arabic text.

Features:

Arabic word Light Stemming.
Root Extraction.
Word Segmentation
Word normalization
Default Arabic Affixes list.
An customizable Light stemmer: possibility of change stemmer options and data.
Data independent stemmer

Authors:: Taha Zerrouki <taha_zerrouki at gmail dot com>, Taha Zerrouki

Contact: taha dot zerrouki at gmail dot com

Copyright: Arabtechies, Arabeyes, Taha Zerrouki

License: GPL

Date: 2017/02/15

Version: 0.3

Instance Methods

[hide private]

__init__(self)

source code

Attribut Functions

unicode.

get_prefix_letters(self)
return the prefixation letters.

source code

set_prefix_letters(self, new_prefix_letters)
set the prefixation letters.

source code

unicode.

get_suffix_letters(self)
return the suffixation letters.

source code

set_suffix_letters(self, new_suffix_letters)
set the suffixation letters.

source code

unicode.

get_infix_letters(self)
get the inffixation letters.

source code

set_infix_letters(self, new_infix_letters)
set the inffixation letters.

source code

unicode.

get_joker(self)
get the joker letter.

source code

set_joker(self, new_joker)
set the joker letter.

source code

integer.

get_max_prefix_length(self)
return the constant of max length of the prefix used by the stemmer.

source code

set_max_prefix_length(self, new_max_prefix_length)
Set the constant of max length of the prefix used by the stemmer.

source code

integer.

get_max_suffix_length(self)
return the constant of max length of the suffix used by the stemmer.

source code

set_max_suffix_length(self, new_max_suffix_length)
Set the constant of max length of the suffix used by the stemmer.

source code

integer.

get_min_stem_length(self)
return the constant of min length of the stem used by the stemmer.

source code

set_min_stem_length(self, new_min_stem_length)
Set the constant of min length of the stem used by the stemmer.

source code

set().

get_prefix_list(self)
return the prefixes list used by the stemmer.

source code

set_prefix_list(self, new_prefix_list)
Set prefixes list used by the stemmer.

source code

set().

get_suffix_list(self)
return the suffixes list used by the stemmer.

source code

set_suffix_list(self, new_suffix_list)
Set suffixes list used by the stemmer.

source code

set_word(self, new_word)
Set the word to treat by the stemmer.

source code

unicode.

get_word(self)
return the last word treated by the stemmer.

source code

Calculated Attribut Functions

unicode.

get_starword(self)
return the starlike word treated by the stemmer.

source code

unicode.

get_root(self, prefix_index=-1, suffix_index=-1)
return the root of the treated word by the stemmer.

source code

unicode.

get_normalized(self)
return the normalized form of the treated word by the stemmer.

source code

unicode.

get_unvocalized(self)
return the unvocalized form of the treated word by the stemmer.

source code

integer.

get_left(self)
return the the left position of stemming (prefixe end position )in the word treated word by the stemmer.

source code

integer.

get_right(self)
return the the right position of stemming (suffixe start position )in the word treated word by the stemmer.

source code

unicode.

get_stem(self, prefix_index=-1, suffix_index=-1)
return the stem of the treated word by the stemmer.

source code

unicode.

get_starstem(self, prefix_index=-1, suffix_index=-1)
return the star form stem of the treated word by the stemmer.

source code

unicode.

get_prefix(self, prefix_index=-1)
return the prefix of the treated word by the stemmer.

source code

unicode.

get_suffix(self, suffix_index=-1)
return the suffix of the treated word by the stemmer.

source code

unicode.

get_affix(self, prefix_index=-1, suffix_index=-1)
return the affix of the treated word by the stemmer.

source code

dict.

get_affix_tuple(self, prefix_index=-1, suffix_index=0)
return the affix tuple of the treated word by the stemmer.

source code

Stemming Functions

unicode.

light_stem(self, word)
Stemming function, stem an arabic word, and return a stem.

source code

tuple.

transform2stars(self, word)
Transform all non affixation letters into a star.

source code

unicode.

extract_root(self, prefix_index=-1, suffix_index=-1)
return the root of the treated word by the stemmer.

source code

Tree stucture

_create_prefix_tree(self, prefixes)
Create a prefixes tree from given prefixes list

source code

Tree stucture

_create_suffix_tree(self, suffixes)
Create a suffixes tree from given suffixes list

source code

list of int

lookup_prefixes(self, word)
lookup for prefixes in the word

source code

list of int

lookup_suffixes(self, word)
lookup for suffixes in the word

source code

Segmentation Functions

set of tuple of integer.

segment(self, word)
generate a list of all posibble segmentation positions (lef, right) of the treated word by the stemmer.

source code

set of tuple of integer.

get_segment_list(self)
return a list of segmentation positions (left, right) of the treated word by the stemmer.

source code

list of dict.

get_affix_list(self)
return a list of affix tuple of the treated word by the stemmer.

source code

General Functions

unicode.

normalize(self, word=u'')
Normalize a word. source code

list.

tokenize(self, text=u'')
Tokenize text into words source code

Method Details

[hide private]

get_prefix_letters(self)

source code

return the prefixation letters. This constant take DEFAULT_PREFIX_LETTERS by default.

Returns: unicode.: return a letters.

set_prefix_letters(self, new_prefix_letters)

source code

set the prefixation letters. This constant take DEFAULT_PREFIX_LETTERS by default.

Parameters:

new_prefix_letters (unicode.) - letters to be striped from a word, e.g.new_prefix_letters = u"وف":.

get_suffix_letters(self)

source code

return the suffixation letters. This constant take DEFAULT_SUFFIX_LETTERS by default.

Returns: unicode.: return a letters.

set_suffix_letters(self, new_suffix_letters)

source code

set the suffixation letters. This constant take DEFAULT_SUFFIX_LETTERS by default.

Parameters:

new_suffix_letters (unicode.) - letters to be striped from the end of a word, e.g.new_suffix_letters = u"ةون":.

get_infix_letters(self)

source code

get the inffixation letters. This constant take DEFAULT_INFIX_LETTERS by default.

Returns: unicode.: infixes letters.

set_infix_letters(self, new_infix_letters)

source code

set the inffixation letters. This constant take DEFAULT_INFIX_LETTERS by default.

Parameters:

new_infix_letters (unicode.) - letters to be striped from the middle of a word, e.g.new_infix_letters = u"أوي":.

get_joker(self)

source code

get the joker letter. This constant take DEFAULT_JOKER by default.

Returns: unicode.: joker letter.

set_joker(self, new_joker)

source code

set the joker letter. This constant take DEFAULT_JOKER by default.

Parameters:

new_joker (unicode.) - joker letter.

get_max_prefix_length(self)

source code

return the constant of max length of the prefix used by the stemmer. This constant take DEFAULT_MAX_PREFIX_LENGTH by default.

Returns: integer.: return a number.

set_max_prefix_length(self, new_max_prefix_length)

source code

Set the constant of max length of the prefix used by the stemmer. This constant take DEFAULT_MAX_PREFIX_LENGTH by default.

Parameters:

new_max_prefix_length (integer.) - the new max prefix length constant.

get_max_suffix_length(self)

source code

return the constant of max length of the suffix used by the stemmer. This constant take DEFAULT_MAX_SUFFIX_LENGTH by default.

Returns: integer.: return a number.

set_max_suffix_length(self, new_max_suffix_length)

source code

Set the constant of max length of the suffix used by the stemmer. This constant take DEFAULT_MAX_SUFFIX_LENGTH by default.

Parameters:

new_max_suffix_length (integer.) - the new max suffix length constant.

get_min_stem_length(self)

source code

return the constant of min length of the stem used by the stemmer. This constant take DEFAULT_MIN_STEM_LENGTH by default.

Returns: integer.: return a number.

set_min_stem_length(self, new_min_stem_length)

source code

Set the constant of min length of the stem used by the stemmer. This constant take DEFAULT_MIN_STEM_LENGTH by default.

Parameters:

new_min_stem_length (integer.) - the min stem length constant.

get_prefix_list(self)

source code

return the prefixes list used by the stemmer. This constant take DEFAULT_PREFIX_LIST by default.

Returns: set().: prefixes list.

set_prefix_list(self, new_prefix_list)

source code

Set prefixes list used by the stemmer. This constant take DEFAULT_PREFIX_LIST by default.

Parameters:

new_prefix_list (set of unicode string.) - a set of prefixes.

get_suffix_list(self)

source code

return the suffixes list used by the stemmer. This constant take DEFAULT_SUFFIX_LIST by default.

Returns: set().: suffixes list.

set_suffix_list(self, new_suffix_list)

source code

Set suffixes list used by the stemmer. This constant take DEFAULT_SUFFIX_LIST by default.

Parameters:

new_suffix_list (set of unicode string.) - a set of suffixes.

set_word(self, new_word)

source code

Set the word to treat by the stemmer.

Parameters:

new_word (unicode.) - the new word.

get_word(self)

source code

return the last word treated by the stemmer.

Returns: unicode.: word.

get_starword(self)

source code

return the starlike word treated by the stemmer. All non affix letters are converted to a joker. The joker take by default DEFAULT_JOKER = "*".

Exmaple:

>>> ArListem = ArabicLightStemmer()
>>> word = u'أفتصربونني'
>>> stem = ArListem.lightStem(word)
>>> print ArListem.get_starword()
أفت***ونني

Returns: unicode.: word.

get_root(self, prefix_index=-1, suffix_index=-1)

source code

return the root of the treated word by the stemmer. All non affix letters are converted to a joker. All letters in the joker places are part of root. The joker take by default DEFAULT_JOKER = "*".

Example:

>>> ArListem = ArabicLightStemmer()
>>> word = u'أفتصربونني'
>>> stem = ArListem.lightStem(word)
>>> print ArListem.get_starword()
أفت***ونني
>>> print ArListem.get_root()
ضرب

Parameters:

prefix_index (integer.) - indicate the left stemming position if = -1: not cosidered, and take the default word prefix lentgh.
suffix_index (integer.) - indicate the right stemming position. if = -1: not cosidered, and take the default word suffix position.

Returns: unicode.

root.

get_normalized(self)

source code

return the normalized form of the treated word by the stemmer. Some letters are converted into normal form like Hamzat.

Example:

>>> word = u"استؤجرُ"
>>> ArListem = ArabicLightStemmer()
>>> stem = ArListem.lightStem(word)
>>> print ArListem.get_normalized()
استءجر

Returns: unicode.: normalized word.

get_unvocalized(self)

source code

return the unvocalized form of the treated word by the stemmer. Harakat are striped.

Example:

>>> word = u"الْعَرَبِيّةُ"
>>> ArListem = ArabicLightStemmer()
>>> stem = ArListem.lightStem(word)
>>> print ArListem.get_unvocalized()
العربية

Returns: unicode.: unvocalized word.

get_left(self)

source code

return the the left position of stemming (prefixe end position )in the word treated word by the stemmer.

Example:

>>> ArListem = ArabicLightStemmer()
>>> word = u'أفتصربونني'
>>> stem = ArListem.lightStem(word)
>>> print ArListem.get_starword()
أفت***ونني
>>> print ArListem.get_left()
3

Returns: integer.: the left position of stemming.

get_right(self)

source code

return the the right position of stemming (suffixe start position )in the word treated word by the stemmer.

Example:

>>> ArListem = ArabicLightStemmer()
>>> word = u'أفتصربونني'
>>> stem = ArListem.lightStem(word)
>>> print ArListem.get_starword()
أفت***ونني
>>> print ArListem.get_right()
6

Returns: integer.: the right position of stemming.

get_stem(self, prefix_index=-1, suffix_index=-1)

source code

return the stem of the treated word by the stemmer.

Example:

>>> ArListem = ArabicLightStemmer()
>>> word = u'أفتكاتبانني'
>>> stem = ArListem.lightStem(word)
>>> print ArListem.get_stem()
كاتب

Parameters:

prefix_index (integer.) - indicate the left stemming position if = -1: not cosidered, and take the default word prefix lentgh.
suffix_index (integer.) - indicate the right stemming position. if = -1: not cosidered, and take the default word suffix position.

Returns: unicode.

stem.

get_starstem(self, prefix_index=-1, suffix_index=-1)

source code

return the star form stem of the treated word by the stemmer. All non affix letters are converted to a joker. The joker take by default DEFAULT_JOKER = "*".

Example:

>>> ArListem = ArabicLightStemmer()
>>> word = u'أفتكاتبانني'
>>> stem = ArListem.lightStem(word)
>>> print ArListem.get_stem()
كاتب
>>> print ArListem.get_starstem()
*ات*

Parameters:

prefix_index (integer.) - indicate the left stemming position if = -1: not cosidered, and take the default word prefix lentgh.
suffix_index (integer.) - indicate the right stemming position. if = -1: not cosidered, and take the default word suffix position.

Returns: unicode.

stared form of stem.

get_prefix(self, prefix_index=-1)

source code

return the prefix of the treated word by the stemmer.

Example:

>>> ArListem = ArabicLightStemmer()
>>> word = u'أفتكاتبانني'
>>> stem = ArListem.lightStem(word)
>>> print ArListem.get_prefix()
أفت

Parameters:

prefix_index (integer.) - indicate the left stemming position if = -1: not cosidered, and take the default word prefix lentgh.

Returns: unicode.

prefixe.

get_suffix(self, suffix_index=-1)

source code

return the suffix of the treated word by the stemmer.

Example:

>>> ArListem = ArabicLightStemmer()
>>> word = u'أفتكاتبانني'
>>> stem = ArListem.lightStem(word)
>>> print ArListem.get_suffix()
انني

Parameters:

suffix_index (integer.) - indicate the right stemming position. if = -1: not cosidered, and take the default word suffix position.

Returns: unicode.

suffixe.

get_affix(self, prefix_index=-1, suffix_index=-1)

source code

return the affix of the treated word by the stemmer.

Example:

>>> ArListem = ArabicLightStemmer()
>>> word = u'أفتكاتبانني'
>>> stem = ArListem.lightStem(word)
>>> print ArListem.get_affix()
أفت-انني

Parameters:

prefix_index (integer.) - indicate the left stemming position if = -1: not cosidered, and take the default word prefix lentgh.
suffix_index (in4teger.) - indicate the right stemming position. if = -1: not cosidered, and take the default word suffix position.

Returns: unicode.

suffixe.

get_affix_tuple(self, prefix_index=-1, suffix_index=0)

source code

return the affix tuple of the treated word by the stemmer.

Example:

>>> ArListem = ArabicLightStemmer()
>>> word = u'أفتضاربانني'
>>> stem = ArListem.lightStem(word)
>>> print ArListem.get_affix_tuple()
{'prefix': u'أفت', 'root': u'ضرب', 'suffix': u'انني', 'stem': u'ضارب'}

Parameters:

prefix_index (integer.) - indicate the left stemming position if = -1: not cosidered, and take the default word prefix lentgh.
suffix_index (integer.) - indicate the right stemming position. if = -1: not cosidered, and take the default word suffix position.

Returns: dict.

affix tuple.

light_stem(self, word)

source code

Stemming function, stem an arabic word, and return a stem. This function store in the instance the stemming positions (left, right), then it's possible to get other calculted attributs like : stem, prefixe, suffixe, root.

Example:

>>> ArListem = ArabicLightStemmer()
>>> word = u'أفتضاربانني'
>>> stem = ArListem.light_stem(word)
>>> print ArListem.get_stem()
ضارب
>>> print ArListem.get_starstem()
*ا**
>>> print ArListem.get_left()
3
>>> print ArListem.get_right()
6
>>> print ArListem.get_root()
ضرب

Parameters:

word (unicode.) - the input word.

Returns: unicode.

stem.

transform2stars(self, word)

source code

Transform all non affixation letters into a star. the star is a joker(by default '*'). which indicates that the correspandent letter is an original. this function is used by the stmmer to identify original letters. and return a stared form and stemming positions (left, right)

Example:

>>> ArListem = ArabicLightStemmer()
>>> word = u'أفتضاربانني'
>>> starword, left, right = ArListem.transformToStrars(word)
(أفت*ا**انني, 3, 6)

Parameters:

word (unicode) - the input word.

Returns: tuple.

(starword, left, right):

starword : all original letters converted into a star
left : the greater possible left stemming position.
right : the greater possible right stemming position.

extract_root(self, prefix_index=-1, suffix_index=-1)

source code

return the root of the treated word by the stemmer. All non affix letters are converted to a joker. All letters in the joker places are part of root. The joker take by default DEFAULT_JOKER = "*".

Example:

>>> ArListem = ArabicLightStemmer()
>>> word = u'أفتصربونني'
>>> stem = ArListem.lightStem(word)
>>> print ArListem.get_starword()
أفت***ونني
>>> print ArListem.get_root()
ضرب

Parameters:

prefix_index (integer.) - indicate the left stemming position if = -1: not cosidered, and take the default word prefix lentgh.
suffix_index (integer.) - indicate the right stemming position. if = -1: not cosidered, and take the default word suffix position.

Returns: unicode.

root.

_create_prefix_tree(self, prefixes)

source code

Create a prefixes tree from given prefixes list

Parameters:

prefixes (list of unicode @return : prefixes tree) - list of prefixes

Returns: Tree stucture

_create_suffix_tree(self, suffixes)

source code

Create a suffixes tree from given suffixes list

Parameters:

suffixes (list of unicode @return : suffixes tree) - list of suffixes

Returns: Tree stucture

lookup_prefixes(self, word)

source code

lookup for prefixes in the word

Parameters:

word (unicode @return : list of prefixes starts positions) - the given word

Returns: list of int

lookup_suffixes(self, word)

source code

lookup for suffixes in the word

Parameters:

word (unicode @return : list of suffixes starts positions) - the given word

Returns: list of int

segment(self, word)

source code

generate a list of all posibble segmentation positions (lef, right) of the treated word by the stemmer.

Example:

>>> ArListem = ArabicLightStemmer()
>>> word = u'فتضربين'
>>> print ArListem.segment(word)
set(([(1, 5), (2, 5), (0, 7)])

Returns: set of tuple of integer.: List of segmentation

get_segment_list(self)

source code

return a list of segmentation positions (left, right) of the treated word by the stemmer.

Example:

>>> ArListem = ArabicLightStemmer()
>>> word = u'فتضربين'
>>> ArListem.segment(word)
>>> print ArListem.get_segment_list()
set(([(1, 5), (2, 5), (0, 7)])

Returns: set of tuple of integer.: List of segmentation

get_affix_list(self)

source code

return a list of affix tuple of the treated word by the stemmer.

Example:

>>> ArListem = ArabicLightStemmer()
>>> word = u'فتضربين'
>>> ArListem.segment(word)
>>> print ArListem.get_affix_list()
[{'prefix': u'ف', 'root': u'ضرب', 'suffix': u'ين', 'stem': u'تضرب'}, 
{'prefix': u'فت', 'root': u'ضرب', 'suffix': u'ين', 'stem': u'ضرب'}, 
{'prefix': u'', 'root': u'فضربن', 'suffix': u'', 'stem': u'فتضربين'}]

Returns: list of dict.: List of Affixes tuple

normalize(self, word=`u''`)

source code

Normalize a word. Convert some leters forms into unified form.

Parameters:

word (unicode.) - the input word, if word is empty, the word member of the class is normalized.

Returns: unicode.

normalized word.

tokenize(self, text=`u''`)

source code

Tokenize text into words

Parameters:

text (unicode.) - the input text.

Returns: list.

list of words.

Class ArabicLightStemmer

Features:

get_prefix_letters(self)

set_prefix_letters(self, new_prefix_letters)

get_suffix_letters(self)

set_suffix_letters(self, new_suffix_letters)

get_infix_letters(self)

set_infix_letters(self, new_infix_letters)

get_joker(self)

set_joker(self, new_joker)

get_max_prefix_length(self)

set_max_prefix_length(self, new_max_prefix_length)

get_max_suffix_length(self)

set_max_suffix_length(self, new_max_suffix_length)

get_min_stem_length(self)

set_min_stem_length(self, new_min_stem_length)

get_prefix_list(self)

set_prefix_list(self, new_prefix_list)

get_suffix_list(self)

set_suffix_list(self, new_suffix_list)

set_word(self, new_word)

get_word(self)

get_starword(self)

get_root(self, prefix_index=-1, suffix_index=-1)

get_normalized(self)

get_unvocalized(self)

get_left(self)

get_right(self)

get_stem(self, prefix_index=-1, suffix_index=-1)

get_starstem(self, prefix_index=-1, suffix_index=-1)

get_prefix(self, prefix_index=-1)

get_suffix(self, suffix_index=-1)

get_affix(self, prefix_index=-1, suffix_index=-1)

get_affix_tuple(self, prefix_index=-1, suffix_index=0)

light_stem(self, word)

transform2stars(self, word)

extract_root(self, prefix_index=-1, suffix_index=-1)

_create_prefix_tree(self, prefixes)

_create_suffix_tree(self, suffixes)

lookup_prefixes(self, word)

lookup_suffixes(self, word)

segment(self, word)

get_segment_list(self)

get_affix_list(self)

normalize(self, word=u'')

tokenize(self, text=u'')

normalize(self, word=`u''`)

tokenize(self, text=`u''`)