Package tashaphyne :: Module normalize
[hide private]
[frames] | no frames]

Module normalize

source code

Utility functions used by to prepare an arabic text to search and index .

Functions [hide private]
    Indivudual Functions
unicode.
strip_tashkeel(text)
Strip vowel from a text and return a result text.
source code
unicode.
strip_tatweel(text)
Strip tatweel from a text and return a result text.
source code
unicode.
normalize_hamza(text)
Normalize Hamza forms into one form, and return a result text.
source code
unicode.
normalize_lamalef(text)
Normalize Lam Alef ligatures into two letters (LAM and ALEF), and return a result text.
source code
unicode.
normalize_spellerrors(text)
Normalize some spellerrors like, TEH_MARBUTA into HEH,ALEF_MAKSURA into YEH, and return a result text.
source code
    Normalize One Function
unicode.
normalize_searchtext(text)
Normalize input text and return a result text.
source code
Variables [hide private]
  __package__ = 'tashaphyne'
Function Details [hide private]

strip_tashkeel(text)

source code 

Strip vowel from a text and return a result text. The striped marks are :

  • FATHA, DAMMA, KASRA
  • SUKUN
  • SHADDA
  • FATHATAN, DAMMATAN, KASRATAN, , , .

Example:

>>> text=u"الْعَرَبِيّةُ"
>>> strip_tashkeel(text)
العربية
Parameters:
  • text (unicode.) - arabic text.
Returns: unicode.
return a striped text.

strip_tatweel(text)

source code 

Strip tatweel from a text and return a result text.

Example:

>>> text=u"العـــــربية"
>>> strip_tatweel(text)
العربية
Parameters:
  • text (unicode.) - arabic text.
Returns: unicode.
return a striped text.

normalize_hamza(text)

source code 

Normalize Hamza forms into one form, and return a result text. The converted letters are :

  • The converted lettersinto HAMZA are: WAW_HAMZA,YEH_HAMZA
  • The converted lettersinto ALEF are: ALEF_MADDA, ALEF_HAMZA_ABOVE, ALEF_HAMZA_BELOW ,HAMZA_ABOVE, HAMZA_BELOW

Example:

>>> text=u"أهؤلاء من أولئكُ"
>>> normalize_hamza(text)
اهءلاء من اولءكُ
Parameters:
  • text (unicode.) - arabic text.
Returns: unicode.
return a converted text.

normalize_lamalef(text)

source code 

Normalize Lam Alef ligatures into two letters (LAM and ALEF), and return a result text. Some systems present lamAlef ligature as a single letter, this function convert it into two letters, The converted letters into LAM and ALEF are :

  • LAM_ALEF, LAM_ALEF_HAMZA_ABOVE, LAM_ALEF_HAMZA_BELOW, LAM_ALEF_MADDA_ABOVE

Example:

>>> text=u"لانها لالئ الاسلام"
>>> normalize_lamalef(text)
لانها لالئ الاسلام
Parameters:
  • text (unicode.) - arabic text.
Returns: unicode.
return a converted text.

normalize_spellerrors(text)

source code 

Normalize some spellerrors like, TEH_MARBUTA into HEH,ALEF_MAKSURA into YEH, and return a result text. In some context users omit the difference between TEH_MARBUTA and HEH, and ALEF_MAKSURA and YEh. The conversions are:

  • TEH_MARBUTA into HEH
  • ALEF_MAKSURA into YEH

Example:

>>> text=u"اشترت سلمى دمية وحلوى"
>>> normalize_spellerrors(text)
اشترت سلمي دميه وحلوي
Parameters:
  • text (unicode.) - arabic text.
Returns: unicode.
return a converted text.

normalize_searchtext(text)

source code 

Normalize input text and return a result text. Normalize a text by :

  • strip tashkeel
  • strip tatweel
  • normalize Hamza
  • normalize Lam Alef.
  • normalize Teh Marbuta and Alef Maksura

Example:

>>> text=u'أستشتري دمـــى آلية لأبنائك قبل الإغلاق'
>>> normalize_searchtext(text)
استشتري دمي اليه لابناءك قبل الاغلاق
Parameters:
  • text (unicode.) - arabic text.
Returns: unicode.
return a normalized text.