ltchinese documentation

ltchinese - A library of utilities for the Chinese language (pinyin, zhuyin, encodings, phonetics, etc.) from http://lost-theory.org.

  • ocrat - represents the data from the Ocrat Mirror Project (stroke order

    GIFs, etc.) at http://lost-theory.org/ocrat

  • conversion - conversion functions between Chinese encodings / representations

  • annotate - annotate and romanize Chinese characters / words with pinyin,

    zhuyin, etc.

  • phonetics - represents the data from the Mandarin Phonetics table (valid

    Mandarin syllables, male / female pronunciation clips)

Ocrat

Represents the data from the Ocrat Mirror Project (stroke order GIFs, etc.) at http://lost-theory.org/ocrat.

ltchinese.ocrat.get_bigchar(uni_char)
Returns the URL of the big character example of the Unicode character uni_char
ltchinese.ocrat.get_calig(uni_char)
Returns the URL of the caligraphy example of the Unicode character uni_char
ltchinese.ocrat.get_sod(uni_char)
Returns the URL of the stroke order diagram of the Unicode character uni_char

Conversion

Conversion functions between various Chinese encodings and representations.

The following terms are used to represent the encodings / representation used in the conversion functions (the samples on the right are for the character U+4E00 (yi1; “one”)):

GB2312 (Kuten/Quwei form): "5027" [used in the "GB2312" field of Unihan.txt]
GB2312 (ISO-2022 form):    "523B" [the "internal representation" of GB code]
EUC-CN:                    "D2BB" [this is the "external encoding" of GB2312-
                                   ISO2022's "internal representation"; also
                                   the form that Ocrat uses]
UTF-8:                     "E4 B8 80" [used in the "UTF-8" field in Unihan.txt]
-------------------------------------------------------------------------------
Unihan UCN:                "U+4E00"   [used by Unicode Inc.]
Unihan NCR (decimal):      "一" [Numerical Character Reference ...
Unihan NCR (hex):          "&x4E00;"   ... used in XML/HTML/SGML]
-------------------------------------------------------------------------------
internal Python unicode:   u"\u4e00"  [this is the most useful form!]
internal Python 'utf8':    "\xe4\xb8\x80"
internal Python 'gb2312':  "\xd2\xbb"
internal Python 'euc-cn':  "\xd2\xbb"
internal Python 'gb18030': "\xd2\xbb"
See these resources for more information:
  • Wikipedia “Extended_Unix_Code” article
    • “EUC-CN is the usual way to use the GB2312 standard for simplified Chinese characters ... the ISO-2022 form of GB2312 is not normally used”
  • Wikipedia “HZ_(encoding)” article (the example conversion)
  • Wikipedia “Numeric_character_reference” article
  • Unihan (look for “Encoding forms”, “Mappings to Major Standards”)
ltchinese.conversion.euc_to_python(hexstr)
Convert a EUC-CN (GB2312) hex to a Python unicode string
ltchinese.conversion.euc_to_utf8(euchex)
Convert EUC hex (e.g. “d2bb”) to UTF8 hex (e.g. “e4 b8 80”)
ltchinese.conversion.gb2312_to_euc(gb2312hex)
Convert GB2312-1980 hex (internal representation) to EUC-CN hex (the “external encoding”)
ltchinese.conversion.kuten_to_gb2312(kuten)
Convert GB kuten / quwei form (94 zones * 94 points) to GB2312-1980 / ISO-2022-CN hex (internal representation)
ltchinese.conversion.ncr_to_python(ncr)
Convert Unicode Numerical Character Reference (e.g. “19968”, “一”, or “一”) to native Python Unicode (u’u4e00’)
ltchinese.conversion.ncrstring_to_python(ncr_string)
Convert a string of Unicode NCRs (e.g. “一一”) to native Python Unicode (u’u4e00u4e00’)
ltchinese.conversion.python_to_euc(uni_char)
Converts a one character Python unicode string (e.g. u’u4e00’) to the corresponding EUC hex (‘d2bb’)
ltchinese.conversion.python_to_ncr(uni_char, **options)

Converts a one character Python unicode string (e.g. u’u4e00’) to the corresponding Unicode NCR (‘&x4E00;’).

Change the output format by passing the following parameters:
  • no parameters - default behavior: hex=True, xml=True
  • decimal=True - output the decimal value instead of hex
  • hex=False - (same as decimal=True)
  • xml=False - just display the decimal or hex value, i.e. strip off the ‘&#’, ‘&x’, and ‘;’
ltchinese.conversion.python_to_ucn(uni_char)
Converts a one character Python unicode string (e.g. u’u4e00’) to the corresponding Unicode UCN (‘U+4E00’)
ltchinese.conversion.string_to_ncr(uni_string, **options)
Converts a Python unicode string (e.g. u’pu012bn yu012bn’) to the corresponding Unicode NCRs. See python_to_ncr for formatting options.
ltchinese.conversion.ucn_to_python(ucn)
Convert a Unicode Universal Character Number (e.g. “U+4E00” or “4E00”) to Python unicode (u’u4e00’)

Annotation

Functions to annotate and romanize (pinyin and zhuyin Unicode) Chinese text (numeric pin1yin1 or hanzi).

ltchinese.annotate.lastvowel(s)
Find the index of the last pinyin vowel (aeiouv) in string s.
ltchinese.annotate.pinyin(pinyin_text)
Convert a numeric pinyin string (e.g. ‘pin1yin1’) to the proper Unicode characters
ltchinese.annotate.syllable_to_pinyin(syl)
Returns proper unicode string for one pinyin syllable + tone number (e.g. ‘yi1’)
ltchinese.annotate.syllable_to_zhuyin(syllable)
Convert a single numeric pinyin syllable (e.g. “yi1”) to a Zhuyin unicode string
ltchinese.annotate.zh_annotate_pinyin(zhtext, encoding='utf-8')
Returns a list of annotated characters (with possible pinyin romanizations) for the input Chinese text
ltchinese.annotate.zhuyin(pinyin_text)
Convert a numeric pinyin string (e.g. ‘pin1yin1’) to a Zhuyin unicode string

Phonetics

class ltchinese.Phonetics

Represents the Phonetics table at http://lost-theory.org/chinese/phonetics

getsound_bpmf(syllable)
Returns the URL of the sound file for syllable (e.g. “yi1” or “yi”) for the zhuyin / BPMF pronuncation
getsound_female(syllable)
Returns the URL of the sound file for syllable (e.g. “yi1”) for the female voice
getsound_male(syllable)
Returns the URL of the sound file for syllable (e.g. “yi1”) for the male voice
getsounds(syllable)
Returns the URLs (male, female, bpmf) for the given syllable (e.g. “yi1”)
valid(syllable)
Returns True if the syllable is recognized in any of the phonetics tables

Indices and tables

Table Of Contents

This Page