Package tashaphyne :: Module normalize

[frames] | no frames]

Module normalize

source code

Utility functions used by to prepare an arabic text to search and index .

Functions

[hide private]

Indivudual Functions

unicode.

strip_tashkeel(text)
Strip vowel from a text and return a result text.

source code

unicode.

strip_tatweel(text)
Strip tatweel from a text and return a result text.

source code

unicode.

normalize_hamza(text)
Normalize Hamza forms into one form, and return a result text.

source code

unicode.

normalize_lamalef(text)
Normalize Lam Alef ligatures into two letters (LAM and ALEF), and return a result text.

source code

unicode.

normalize_spellerrors(text)
Normalize some spellerrors like, TEH_MARBUTA into HEH,ALEF_MAKSURA into YEH, and return a result text.

source code

Normalize One Function

unicode.

normalize_searchtext(text)
Normalize input text and return a result text.

source code

Variables

[hide private]

__package__ = 'tashaphyne'

Function Details

[hide private]

strip_tashkeel(text)

source code

Strip vowel from a text and return a result text. The striped marks are :

FATHA, DAMMA, KASRA
SUKUN
SHADDA
FATHATAN, DAMMATAN, KASRATAN, , , .

Example:

>>> text=u"الْعَرَبِيّةُ"
>>> strip_tashkeel(text)
العربية

Parameters:

text (unicode.) - arabic text.

Returns: unicode.

return a striped text.

strip_tatweel(text)

source code

Strip tatweel from a text and return a result text.

Example:

>>> text=u"العـــــربية"
>>> strip_tatweel(text)
العربية

Parameters:

text (unicode.) - arabic text.

Returns: unicode.

return a striped text.

normalize_hamza(text)

source code

Normalize Hamza forms into one form, and return a result text. The converted letters are :

The converted lettersinto HAMZA are: WAW_HAMZA,YEH_HAMZA
The converted lettersinto ALEF are: ALEF_MADDA, ALEF_HAMZA_ABOVE, ALEF_HAMZA_BELOW ,HAMZA_ABOVE, HAMZA_BELOW

Example:

>>> text=u"أهؤلاء من أولئكُ"
>>> normalize_hamza(text)
اهءلاء من اولءكُ

Parameters:

text (unicode.) - arabic text.

Returns: unicode.

return a converted text.

normalize_lamalef(text)

source code

Normalize Lam Alef ligatures into two letters (LAM and ALEF), and return a result text. Some systems present lamAlef ligature as a single letter, this function convert it into two letters, The converted letters into LAM and ALEF are :

LAM_ALEF, LAM_ALEF_HAMZA_ABOVE, LAM_ALEF_HAMZA_BELOW, LAM_ALEF_MADDA_ABOVE

Example:

>>> text=u"لانها لالئ الاسلام"
>>> normalize_lamalef(text)
لانها لالئ الاسلام

Parameters:

text (unicode.) - arabic text.

Returns: unicode.

return a converted text.

normalize_spellerrors(text)

source code

Normalize some spellerrors like, TEH_MARBUTA into HEH,ALEF_MAKSURA into YEH, and return a result text. In some context users omit the difference between TEH_MARBUTA and HEH, and ALEF_MAKSURA and YEh. The conversions are:

TEH_MARBUTA into HEH
ALEF_MAKSURA into YEH

Example:

>>> text=u"اشترت سلمى دمية وحلوى"
>>> normalize_spellerrors(text)
اشترت سلمي دميه وحلوي

Parameters:

text (unicode.) - arabic text.

Returns: unicode.

return a converted text.

normalize_searchtext(text)

source code

Normalize input text and return a result text. Normalize a text by :

strip tashkeel
strip tatweel
normalize Hamza
normalize Lam Alef.
normalize Teh Marbuta and Alef Maksura

Example:

>>> text=u'أستشتري دمـــى آلية لأبنائك قبل الإغلاق'
>>> normalize_searchtext(text)
استشتري دمي اليه لابناءك قبل الاغلاق

Parameters:

text (unicode.) - arabic text.

Returns: unicode.

return a normalized text.