The trans module

This module translates national characters into similar sounding latin characters (transliteration). At the moment, Greek, Turkish, Russian, Ukrainian, Czech, Polish, Latvian alphabets are supported (it covers 99% of needs).

Contents

Simple usage
Define user tables
Finally

Simple usage

It's very easy to use

>>> # coding: utf-8
>>> import trans
>>> u'Hello World!'.encode('trans')
u'Hello World!'
>>> u'Привет, Мир!'.encode('trans')
u'Privet, Mir!'

Work only with unicode strings

>>> 'Hello World!'.encode('trans')
Traceback (most recent call last):
    ...
TypeError: trans codec support only unicode string, <type 'str'> given.

This is readability

>>> s = u'''\
...    -- Раскудрить твою через коромысло в бога душу мать
...             триста тысяч раз едрену вошь тебе в крыло
...             и кактус в глотку! -- взревел разъяренный Никодим.
...    -- Аминь, -- робко добавил из склепа папа Пий.
...                 (c) Г. Л. Олди, "Сказки дедушки вампира".'''
>>>
>>> print s.encode('trans')
   -- Raskudrit tvoyu cherez koromyslo v boga dushu mat
            trista tysyach raz edrenu vosh tebe v krylo
            i kaktus v glotku! -- vzrevel razyarennyy Nikodim.
   -- Amin, -- robko dobavil iz sklepa papa Piy.
                (c) G. L. Oldi, "Skazki dedushki vampira".

Table "id"

Use the table "id", leaving only the Latin characters, digits and underscores:

>>> print u'1 2 3 4 5 \n6 7 8 9 0'.encode('trans')
1 2 3 4 5
6 7 8 9 0
>>> print u'1 2 3 4 5 \n6 7 8 9 0'.encode('trans/id')
1_2_3_4_5__6_7_8_9_0
>>> s.encode('trans/id')[-42:-1]
u'_c__G__L__Oldi___Skazki_dedushki_vampira_'

Define user tables

Simple variant

>>> u'1 2 3 4 5 6 7 8 9 0'.encode('trans/my')
Traceback (most recent call last):
    ...
ValueError: Table "my" not found in tables!
>>> trans.tables['my'] = {u'1': u'A', u'2': u'B'};
>>> u'1 2 3 4 5 6 7 8 9 0'.encode('trans/my')
u'A_B________________'
>>>

A little harder

Table can consist of two parts - the map of diphthongs and map of characters. First are processed diphthongs, by simple replacement on the substring. Then according to the map of characters, replacing each character of string by it's mapping. If character is absent in characters map, checked key None, if not, then is used the default character u'_'.

>>> diphthongs = {u'11': u'AA', u'22': u'BB'}
>>> characters = {u'a': u'z', u'b': u'y', u'c': u'x',
...               u'A': u'A', u'B': u'B', None: u'-'}
>>> trans.tables['test'] = (diphthongs, characters)
>>> u'11abc22cbaCC'.encode('trans/test')
u'AAzyxBBxyz--'

The characters created by processing of diphthongs are also processed by the map of the symbols:

>>> diphthongs = {u'11': u'AA', u'22': u'BB'}
>>> characters = {u'a': u'z', u'b': u'y', u'c': u'x', None: u'-'}
>>> trans.tables['test'] = (diphthongs, characters)
>>> u'11abc22cbaCC'.encode('trans/test')
u'--zyx--xyz--'

Without the diphthongs

These two tables are equivalent:

>>> characters = {u'a': u'z', u'b': u'y', u'c': u'x', None: u'-'}
>>> trans.tables['t1'] = characters
>>> trans.tables['t2'] = ({}, characters)
>>> u'11abc22cbaCC'.encode('trans/t1') == u'11abc22cbaCC'.encode('trans/t2')
True

Finally

Special thanks to Yuri Yurevich aka j2a for the kick in the right direction.
- http://www.python.su/forum/viewtopic.php?pid=28965
- http://code.djangoproject.com/browser/django/trunk/django/contrib/admin/media/js/urlify.js
I please forgiveness for my bad English. I promise to be corrected.