ltchinese documentation
ltchinese - A library of utilities for the Chinese language (pinyin, zhuyin,
encodings, phonetics, etc.) from http://lost-theory.org.
Ocrat
Represents the data from the Ocrat Mirror Project (stroke order GIFs, etc.)
at http://lost-theory.org/ocrat.
-
ltchinese.ocrat.get_bigchar(uni_char)
- Returns the URL of the big character example of the Unicode character uni_char
-
ltchinese.ocrat.get_calig(uni_char)
- Returns the URL of the caligraphy example of the Unicode character uni_char
-
ltchinese.ocrat.get_sod(uni_char)
- Returns the URL of the stroke order diagram of the Unicode character uni_char
Conversion
Conversion functions between various Chinese encodings and representations.
The following terms are used to represent the encodings / representation used
in the conversion functions (the samples on the right are for the character
U+4E00 (yi1; “one”)):
GB2312 (Kuten/Quwei form): "5027" [used in the "GB2312" field of Unihan.txt]
GB2312 (ISO-2022 form): "523B" [the "internal representation" of GB code]
EUC-CN: "D2BB" [this is the "external encoding" of GB2312-
ISO2022's "internal representation"; also
the form that Ocrat uses]
UTF-8: "E4 B8 80" [used in the "UTF-8" field in Unihan.txt]
-------------------------------------------------------------------------------
Unihan UCN: "U+4E00" [used by Unicode Inc.]
Unihan NCR (decimal): "一" [Numerical Character Reference ...
Unihan NCR (hex): "&x4E00;" ... used in XML/HTML/SGML]
-------------------------------------------------------------------------------
internal Python unicode: u"\u4e00" [this is the most useful form!]
internal Python 'utf8': "\xe4\xb8\x80"
internal Python 'gb2312': "\xd2\xbb"
internal Python 'euc-cn': "\xd2\xbb"
internal Python 'gb18030': "\xd2\xbb"
- See these resources for more information:
- Wikipedia “Extended_Unix_Code” article
- “EUC-CN is the usual way to use the GB2312 standard for simplified Chinese
characters ... the ISO-2022 form of GB2312 is not normally used”
- Wikipedia “HZ_(encoding)” article (the example conversion)
- Wikipedia “Numeric_character_reference” article
- Unihan (look for “Encoding forms”, “Mappings to Major Standards”)
-
ltchinese.conversion.euc_to_python(hexstr)
- Convert a EUC-CN (GB2312) hex to a Python unicode string
-
ltchinese.conversion.euc_to_utf8(euchex)
- Convert EUC hex (e.g. “d2bb”) to UTF8 hex (e.g. “e4 b8 80”)
-
ltchinese.conversion.gb2312_to_euc(gb2312hex)
- Convert GB2312-1980 hex (internal representation) to EUC-CN hex (the “external encoding”)
-
ltchinese.conversion.kuten_to_gb2312(kuten)
- Convert GB kuten / quwei form (94 zones * 94 points) to GB2312-1980 / ISO-2022-CN hex (internal representation)
-
ltchinese.conversion.ncr_to_python(ncr)
- Convert Unicode Numerical Character Reference (e.g. “19968”, “一”, or “一”) to native Python Unicode (u’u4e00’)
-
ltchinese.conversion.ncrstring_to_python(ncr_string)
- Convert a string of Unicode NCRs (e.g. “一一”) to native Python Unicode (u’u4e00u4e00’)
-
ltchinese.conversion.python_to_euc(uni_char)
- Converts a one character Python unicode string (e.g. u’u4e00’) to the corresponding EUC hex (‘d2bb’)
-
ltchinese.conversion.python_to_ncr(uni_char, **options)
Converts a one character Python unicode string (e.g. u’u4e00’) to the corresponding Unicode NCR (‘&x4E00;’).
- Change the output format by passing the following parameters:
- no parameters - default behavior: hex=True, xml=True
- decimal=True - output the decimal value instead of hex
- hex=False - (same as decimal=True)
- xml=False - just display the decimal or hex value, i.e. strip off the ‘&#’, ‘&x’, and ‘;’
-
ltchinese.conversion.python_to_ucn(uni_char)
- Converts a one character Python unicode string (e.g. u’u4e00’) to the corresponding Unicode UCN (‘U+4E00’)
-
ltchinese.conversion.string_to_ncr(uni_string, **options)
- Converts a Python unicode string (e.g. u’pu012bn yu012bn’) to the corresponding Unicode NCRs.
See python_to_ncr for formatting options.
-
ltchinese.conversion.ucn_to_python(ucn)
- Convert a Unicode Universal Character Number (e.g. “U+4E00” or “4E00”) to Python unicode (u’u4e00’)
Annotation
Functions to annotate and romanize (pinyin and zhuyin Unicode) Chinese text (numeric pin1yin1 or hanzi).
-
ltchinese.annotate.lastvowel(s)
- Find the index of the last pinyin vowel (aeiouv) in string s.
-
ltchinese.annotate.pinyin(pinyin_text)
- Convert a numeric pinyin string (e.g. ‘pin1yin1’) to the proper Unicode characters
-
ltchinese.annotate.syllable_to_pinyin(syl)
- Returns proper unicode string for one pinyin syllable + tone number (e.g. ‘yi1’)
-
ltchinese.annotate.syllable_to_zhuyin(syllable)
- Convert a single numeric pinyin syllable (e.g. “yi1”) to a Zhuyin unicode string
-
ltchinese.annotate.zh_annotate_pinyin(zhtext, encoding='utf-8')
- Returns a list of annotated characters (with possible pinyin romanizations) for the input Chinese text
-
ltchinese.annotate.zhuyin(pinyin_text)
- Convert a numeric pinyin string (e.g. ‘pin1yin1’) to a Zhuyin unicode string
Phonetics
-
class ltchinese.Phonetics
Represents the Phonetics table at http://lost-theory.org/chinese/phonetics
-
getsound_bpmf(syllable)
- Returns the URL of the sound file for syllable (e.g. “yi1” or “yi”) for the zhuyin / BPMF pronuncation
-
getsound_female(syllable)
- Returns the URL of the sound file for syllable (e.g. “yi1”) for the female voice
-
getsound_male(syllable)
- Returns the URL of the sound file for syllable (e.g. “yi1”) for the male voice
-
getsounds(syllable)
- Returns the URLs (male, female, bpmf) for the given syllable (e.g. “yi1”)
-
valid(syllable)
- Returns True if the syllable is recognized in any of the phonetics tables