ltchinese documentation¶

ltchinese - A library of utilities for the Chinese language (pinyin, zhuyin, encodings, phonetics, etc.) from http://lost-theory.org.

ocrat - represents the data from the Ocrat Mirror Project (stroke order

GIFs, etc.) at http://lost-theory.org/ocrat

conversion - conversion functions between Chinese encodings / representations

annotate - annotate and romanize Chinese characters / words with pinyin,

zhuyin, etc.

phonetics - represents the data from the Mandarin Phonetics table (valid

Mandarin syllables, male / female pronunciation clips)

Ocrat¶

Represents the data from the Ocrat Mirror Project (stroke order GIFs, etc.) at http://lost-theory.org/ocrat.

ltchinese.ocrat.get_bigchar(uni_char)¶: Returns the URL of the big character example of the Unicode character uni_char

ltchinese.ocrat.get_calig(uni_char)¶: Returns the URL of the caligraphy example of the Unicode character uni_char

ltchinese.ocrat.get_sod(uni_char)¶: Returns the URL of the stroke order diagram of the Unicode character uni_char

Conversion¶

Conversion functions between various Chinese encodings and representations.

The following terms are used to represent the encodings / representation used in the conversion functions (the samples on the right are for the character U+4E00 (yi1; “one”)):

GB2312 (Kuten/Quwei form): "5027" [used in the "GB2312" field of Unihan.txt]
GB2312 (ISO-2022 form):    "523B" [the "internal representation" of GB code]
EUC-CN:                    "D2BB" [this is the "external encoding" of GB2312-
                                   ISO2022's "internal representation"; also
                                   the form that Ocrat uses]
UTF-8:                     "E4 B8 80" [used in the "UTF-8" field in Unihan.txt]
-------------------------------------------------------------------------------
Unihan UCN:                "U+4E00"   [used by Unicode Inc.]
Unihan NCR (decimal):      "&#19968;" [Numerical Character Reference ...
Unihan NCR (hex):          "&x4E00;"   ... used in XML/HTML/SGML]
-------------------------------------------------------------------------------
internal Python unicode:   u"\u4e00"  [this is the most useful form!]
internal Python 'utf8':    "\xe4\xb8\x80"
internal Python 'gb2312':  "\xd2\xbb"
internal Python 'euc-cn':  "\xd2\xbb"
internal Python 'gb18030': "\xd2\xbb"

See these resources for more information:

Wikipedia “Extended_Unix_Code” article
- “EUC-CN is the usual way to use the GB2312 standard for simplified Chinese characters ... the ISO-2022 form of GB2312 is not normally used”
Wikipedia “HZ_(encoding)” article (the example conversion)
Wikipedia “Numeric_character_reference” article
Unihan (look for “Encoding forms”, “Mappings to Major Standards”)
- e.g. http://www.unicode.org/cgi-bin/GetUnihanData.pl?codepoint=4E00

ltchinese.conversion.euc_to_python(hexstr)¶: Convert a EUC-CN (GB2312) hex to a Python unicode string

ltchinese.conversion.euc_to_utf8(euchex)¶: Convert EUC hex (e.g. “d2bb”) to UTF8 hex (e.g. “e4 b8 80”)

ltchinese.conversion.gb2312_to_euc(gb2312hex)¶: Convert GB2312-1980 hex (internal representation) to EUC-CN hex (the “external encoding”)

ltchinese.conversion.kuten_to_gb2312(kuten)¶: Convert GB kuten / quwei form (94 zones * 94 points) to GB2312-1980 / ISO-2022-CN hex (internal representation)

ltchinese.conversion.ncr_to_python(ncr)¶: Convert Unicode Numerical Character Reference (e.g. “19968”, “一”, or “一”) to native Python Unicode (u’u4e00’)

ltchinese.conversion.ncrstring_to_python(ncr_string)¶: Convert a string of Unicode NCRs (e.g. “一一”) to native Python Unicode (u’u4e00u4e00’)

ltchinese.conversion.python_to_euc(uni_char)¶: Converts a one character Python unicode string (e.g. u’u4e00’) to the corresponding EUC hex (‘d2bb’)

ltchinese.conversion.python_to_ncr(uni_char, **options)¶

Converts a one character Python unicode string (e.g. u’u4e00’) to the corresponding Unicode NCR (‘&x4E00;’).

Change the output format by passing the following parameters:

no parameters - default behavior: hex=True, xml=True
decimal=True - output the decimal value instead of hex
hex=False - (same as decimal=True)
xml=False - just display the decimal or hex value, i.e. strip off the ‘&#’, ‘&x’, and ‘;’

ltchinese.conversion.python_to_ucn(uni_char)¶: Converts a one character Python unicode string (e.g. u’u4e00’) to the corresponding Unicode UCN (‘U+4E00’)

ltchinese.conversion.string_to_ncr(uni_string, **options)¶: Converts a Python unicode string (e.g. u’pu012bn yu012bn’) to the corresponding Unicode NCRs. See python_to_ncr for formatting options.

ltchinese.conversion.ucn_to_python(ucn)¶: Convert a Unicode Universal Character Number (e.g. “U+4E00” or “4E00”) to Python unicode (u’u4e00’)

Annotation¶

Functions to annotate and romanize (pinyin and zhuyin Unicode) Chinese text (numeric pin1yin1 or hanzi).

ltchinese.annotate.lastvowel(s)¶: Find the index of the last pinyin vowel (aeiouv) in string s.

ltchinese.annotate.pinyin(pinyin_text)¶: Convert a numeric pinyin string (e.g. ‘pin1yin1’) to the proper Unicode characters

ltchinese.annotate.syllable_to_pinyin(syl)¶: Returns proper unicode string for one pinyin syllable + tone number (e.g. ‘yi1’)

ltchinese.annotate.syllable_to_zhuyin(syllable)¶: Convert a single numeric pinyin syllable (e.g. “yi1”) to a Zhuyin unicode string

ltchinese.annotate.zh_annotate_pinyin(zhtext, encoding='utf-8')¶: Returns a list of annotated characters (with possible pinyin romanizations) for the input Chinese text

ltchinese.annotate.zhuyin(pinyin_text)¶: Convert a numeric pinyin string (e.g. ‘pin1yin1’) to a Zhuyin unicode string

Phonetics¶

class ltchinese.Phonetics¶

Represents the Phonetics table at http://lost-theory.org/chinese/phonetics

getsound_bpmf(syllable)¶: Returns the URL of the sound file for syllable (e.g. “yi1” or “yi”) for the zhuyin / BPMF pronuncation

getsound_female(syllable)¶: Returns the URL of the sound file for syllable (e.g. “yi1”) for the female voice

getsound_male(syllable)¶: Returns the URL of the sound file for syllable (e.g. “yi1”) for the male voice

getsounds(syllable)¶: Returns the URLs (male, female, bpmf) for the given syllable (e.g. “yi1”)

valid(syllable)¶: Returns True if the syllable is recognized in any of the phonetics tables

ltchinese documentation¶

Ocrat¶

Conversion¶

Annotation¶

Phonetics¶

Indices and tables¶

Table Of Contents

This Page

Navigation

ltchinese documentation¶

Ocrat¶

Conversion¶

Annotation¶

Phonetics¶

Indices and tables¶

Table Of Contents

This Page

Quick search

Navigation