Package cjklib :: Module characterlookup :: Class CharacterLookup

Class CharacterLookup

CharacterLookup provides access to lookup methods related to Han characters.

The real system of CharacterLookup lies in the database beneath where all relevant data is stored. So for nearly all methods this class needs access to a database. Thus on initialisation of the object a connection to a database is established, the logic for this provided by the DatabaseConnector.

See the DatabaseConnector for supported database systems.

CharacterLookup will try to read the config file from either /etc or the users home folder. If none is present it will try to open a SQLite database stored as db in the same folder by default. You can override this behaviour by specifying additional parameters on creation of the object.

Examples

The following examples should give a quick view into how to use this package.

Create the CharacterLookup object with default settings (read from cjklib.conf or 'cjklib.db' in same directory as default):
```
>>> from cjklib import characterlookup
>>> cjk = characterlookup.CharacterLookup()
```

Get a list of characters, that are pronounced "국" in Korean:

>>> cjk.getCharactersForReading(u'국', 'Hangul')
[u'匊', u'國', u'局', u'掬', u'菊', u'跼', u'鞠', u'鞫', u'麯', u'麴']

Check if a character is included in another character as a component:
```
>>> cjk.isComponentInCharacter(u'女', u'好')
True
```
Get all Kangxi radical variants for Radical 184 (⾷) under the traditional locale:
```
>>> cjk.getKangxiRadicalVariantForms(184, 'T')
[u'⻞', u'⻟']
```

Character locale

During the development of characters in the different cultures character appearances changed over time to that extent, that the handling of radicals, character components and strokes needs to be distinguished, depending on the locale.

To deal with this circumstance CharacterLookup works with a character locale. Most of the methods of this class ask for a locale to be specified. In these cases the output of the method depends on the specified locale.

For example in the traditional locale 这 has 8 strokes, but in simplified Chinese it has only 7, as the radical ⻌ has different stroke counts, depending on the locale.

Z-variants

One feature of Chinese characters is the glyph form describing the visual representation. This feature doesn't need to be unique and so many characters can be found in different writing variants e.g. character 福 (English: luck) which has numerous forms.

The Unicode Consortium does not include same characters of different actual shape in the Unicode standard (called Z-variants), except a few "double" entries which are included as to maintain backward compatibility. In fact a code point represents an abstract character not defining any visual representation. Thus a distinct appearance description including strokes and stroke order cannot be simply assigned to a code point but one needs to deal with the notion of Z-variants representing distinct glyphs to which a visual description can be applied.

The name Z-variant is derived from the three-dimensional model representing the space of characters relative to three axis, being the X axis representing the semantic space, the Y axis representing the abstract shape space and finally the Z axis for typeface differences (see "Principles of Han Unification" in: The Unicode Standard 5.0, chapter 12). Character presentations only differing in the Z dimension are generally unified.

cjklib tries to offer a simple approach to handle different Z-variants. As character components, strokes and the stroke order depend on this variant, methods dealing with this kind will ask for a Z-variant value to be specified. In these cases the output of the method depends on the specified variant.

Z-variants and character locales

Deviant stroke count, stroke order or decomposition into character components for different character locales is implemented using different Z-variants. For the example given above the entry 这 with 8 strokes is given as one Z-variant and the form with 7 strokes is given as another Z-variant.

In most cases one might only be interested in a single visual appearance, the "standard" one. This visual appearance would be the one generally used in the specific locale.

Instead of specifying a certain Z-variant most functions will allow for passing of a character locale. Giving the locale will apply the default Z-variant given by the mapping defined in the database which can be obtained by calling getLocaleDefaultZVariant().

More complex relations as which of several Z-variants for a given character are used in a given locale are not covered.

Kangxi radical functions

Using the Unihan database queries about the Kangxi radical of characters can be made. It is possible to get a Kangxi radical for a character or lookup all characters for a given radical.

Unicode has extra code points for radical forms (e.g. ⾔), here called Unicode radical forms, and radical variant forms (e.g. ⻈), here called Unicode radical variants. These characters should be used when explicitly referring to their function as radicals. For most of the radicals and variants their exist complementary character forms which have the same appearance (e.g. 言 and 讠) and which shall be called equivalent characters here.

Mapping from one to another side is not trivially possible, as some forms only exist as radical forms, some only as character forms, but from their meaning used in the radical context (called isolated radical characters here, e.g. 訁 for Kangxi radical 149).

Additionally a one to one mapping can't be guaranteed, as some forms have two or more equivalent forms in another domain, and mapping is highly dependant on the locale.

CharacterLookup provides methods for dealing with this different kinds of characters and the mapping between them.

Character decomposition

Many characters can be decomposed into two or more components, that again are Chinese characters. This fact can be used in many ways, including character lookup, finding patterns for font design or studying characters. Even the stroke order and stroke count can be deduced from the stroke information of the character's components.

Character decomposition is highly dependant on the appearance of the character, so both Z-variant and character locale need to be clear when looking at a decomposition into components.

More points render this task more complex: decomposition into one set of components is not distinct, some characters can be broken down into different sets. Furthermore sometimes one component can be given, but the other component will not be encoded as a character in its own right.

These components again might be characters that contain further components (again not distinct ones), thus a complex decomposition in several steps is possible.

The basis for the character decomposition lies in the database, where all decompositions are stored, using Ideographic Description Sequences (IDS). These sequences consist of Unicode IDS operators and characters to describe the structure of the character. There are binary IDS operators to describe decomposition into two components (e.g. ⿰ for one component left, one right as in 好: ⿰女子) or trinary IDS operators for decomposition into three components (e.g. ⿲ for three components from left to right as in 辨: ⿲⾟刂⾟). Using IDS operators it is possible to give a basic structural information, that in many cases is enough for example to derive a overall stroke order from two single sets of stroke orders. Further more it is possible to look for redundant information in different entries and thus helps to keep the definition data clean.

This class provides methods for retrieving the basic partition entries, lookup of characters by components and decomposing as a tree from the character as a root down to the minimal components as leaf nodes.

TODO: Policy about what to classify as partition.

Strokes

Chinese characters consist of different strokes as basic parts. These strokes are written in a mostly distinct order called the stroke order and have a distinct stroke count.

The stroke order in the writing of Chinese characters is important e.g. for calligraphy or students learning new characters and is normally fixed as there is only one possible stroke order for each character. Further more there is a fixed set of possible strokes and these strokes carry names.

As with character decomposition the stroke order and stroke count is highly dependant on the appearance of the character, so both Z-variant and character locale need to be known.

Further more the order of strokes can be useful for lookup of characters, and so CharacterLookup provides different methods for getting the stroke count, stroke order, lookup of stroke names and lookup of characters by stroke types and stroke order.

Most methods work with an abbreviation of stroke names using the first letters of each syllable of the Chinese name in Pinyin.

The stroke order is not always quite clear and even academics fight about which order should be considered the correct one, a discussion that shouldn't be taking lightly. This circumstance should be considered when working with stroke orders.

TODO: About plans of cjklib how to support different views on the stroke order

TODO: About the different classifications of strokes

Readings

See module reading for a detailed description.

See Also:

Radicals: http://en.wikipedia.org/wiki/Radical_(Chinese_character)
Z-variants: http://www.unicode.org/reports/tr38/tr38-5.html#N10211

To Do (Lang): Add option to component decomposition methods to stop on Kangxi radical forms without breaking further down beyond those.

To Do (Fix):

Incorporate stroke lookup (bigram) techniques
How to handle character forms (either decomposition or stroke order), that can only be found as a component in other characters? We already mark them by flagging it with an 'S'.

To Do (Impl):

Think about applying locale at object creation time and not passing it on every method call. Would make the class easier to use.
Create a method for specifying which character range is of interest for the return values of methods. Narrowing the return results is a further way to locale dependant responses. E.g. cjknife could take this into account when only displaying characters that can be displayed with the current locale (BIG5, GBK...).

Instance Methods

[hide private]

__init__(self, databaseUrl=None, dbConnectInst=None)
Initialises the CharacterLookup.

source code

instance

_getReadingFactory(self)
Gets the ReadingFactory instance.

source code

str

_locale(self, locale)
Gets the locale search value for a database lookup on databases with character locale dependant content.

source code

Character reading lookup

list of str

getCharactersForReading(self, readingString, readingN, **options)
Gets all know characters for the given reading.

source code

str

getReadingForCharacter(self, char, readingN, **options)
Gets all know readings for the character in the given target reading.

source code

str

_getCompatibleCharacterReading(self, readingN, toCharReading=True)
Gets a reading where a mapping from to Chinese characters is supported and that is compatible (a conversion is supported) to the given reading.

source code

Character variant lookup

list of str

getCharacterVariants(self, char, variantType)
Gets the variant forms of the given type for the character.

source code

list of tuple

getAllCharacterVariants(self, char)
Gets all variant forms regardless of the type for the character.

source code

int

getLocaleDefaultZVariant(self, char, locale)
Gets the default Z-variant for the given character under the given locale.

source code

list of int

getCharacterZVariants(self, char)
Gets a list of character Z-variant indices (glyphs) supported by the database.

source code

Character stroke functions

int

getStrokeCount(self, char, locale=None, zVariant=0)
Gets the stroke count for the given character.

source code

dict

getStrokeCountDict(self)
Gets the full stroke count table from the database.

source code

str

getStrokeForAbbrev(self, abbrev)
Gets the stroke form for the given abbreviated name (e.g.

source code

str

getStrokeForName(self, name)
Gets the stroke form for the given name (e.g.

source code

str

getStrokeOrder(self, char, locale=None, zVariant=0)
Gets the stroke order sequence for the given character.

source code

Character radical functions

int

getCharacterKangxiRadicalIndex(self, char)
Gets the Kangxi radical index for the given character as defined by the Unihan database.

source code

list of tuple

getCharacterKangxiRadicalResidualStrokeCount(self, char, locale=None, zVariant=0)
Gets the Kangxi radical form (either a Unicode radical form or a Unicode radical variant) found as a component in the character and the stroke count of the residual character components.

source code

list of tuple

getCharacterRadicalResidualStrokeCount(self, char, radicalIndex, locale=None, zVariant=0)
Gets the radical form (either a Unicode radical form or a Unicode radical variant) found as a component in the character and the stroke count of the residual character components.

source code

dict

getCharacterRadicalResidualStrokeCountDict(self)
Gets the full table of radical forms (either a Unicode radical form or a Unicode radical variant) found as a component in the character and the stroke count of the residual character components from the database.

source code

int

getCharacterKangxiResidualStrokeCount(self, char, locale=None, zVariant=0)
Gets the stroke count of the residual character components when leaving aside the radical form.

source code

int

getCharacterResidualStrokeCount(self, char, radicalIndex, locale=None, zVariant=0)
Gets the stroke count of the residual character components when leaving aside the radical form.

source code

dict

getCharacterResidualStrokeCountDict(self)
Gets the full table of stroke counts of the residual character components from the database.

source code

list of str

getCharactersForKangxiRadicalIndex(self, radicalIndex)
Gets all characters for the given Kangxi radical index.

source code

list of str

getCharactersForRadicalIndex(self, radicalIndex)
Gets all characters for the given radical index.

source code

list of tuple

getResidualStrokeCountForKangxiRadicalIndex(self, radicalIndex)
Gets all characters and residual stroke count for the given Kangxi radical index.

source code

list of tuple

getResidualStrokeCountForRadicalIndex(self, radicalIndex)
Gets all characters and residual stroke count for the given radical index.

source code

Radical form functions

str

getKangxiRadicalForm(self, radicalIdx, locale)
Gets a Unicode radical form for the given Kangxi radical index.

source code

list of str

getKangxiRadicalVariantForms(self, radicalIdx, locale)
Gets a list of Unicode radical variants for the given Kangxi radical index.

source code

int

getKangxiRadicalIndex(self, radicalForm, locale=None)
Gets the Kangxi radical index for the given form.

source code

list of str

getKangxiRadicalRepresentativeCharacters(self, radicalIdx, locale)
Gets a list of characters that represent the radical for the given Kangxi radical index.

source code

bool

isKangxiRadicalFormOrEquivalent(self, form, locale=None)
Checks if the given form is a Kangxi radical form or a radical equivalent.

source code

bool

isRadicalChar(self, char)
Checks if the given character is a Unicode radical form or Unicode radical variant.

source code

str

getRadicalFormEquivalentCharacter(self, radicalForm, locale)
Gets the equivalent character of the given Unicode radical form or Unicode radical variant.

source code

list of str

getCharacterEquivalentRadicalForms(self, equivalentForm, locale)
Gets Unicode radical forms or Unicode radical variants for the given equivalent character.

source code

Character component functions

list of tuple

getCharactersForComponents(self, componentList, locale, includeEquivalentRadicalForms=True, resultIncludeRadicalForms=False)
Gets all characters that contain the given components.

source code

list of tuple

getCharactersForEquivalentComponents(self, componentConstruct, locale=None, resultIncludeRadicalForms=False)
Gets all characters that contain at least one component per list entry, sorted by stroke count if available.

source code

list

getDecompositionEntries(self, char, locale=None, zVariant=0)
Gets the decomposition of the given character into components from the database.

source code

dict

getDecompositionEntriesDict(self)
Gets the full decomposition table from the database.

source code

list

_getDecompositionFromString(self, decomposition)
Gets a tuple representation with character/Z-variant of the given character's decomposition into components.

source code

list

getDecompositionTreeList(self, char, locale=None, zVariant=0)
Gets the decomposition of the given character into components as a list of decomposition trees.

source code

bool

isComponentInCharacter(self, component, char, locale=None, zVariant=0, componentZVariant=None)
Checks if the given character contains the second character as a component.

source code

Class Methods

[hide private]

Character component functions

bool

isBinaryIDSOperator(cls, char)
Checks if given character is a binary IDS operator.

source code

bool

isTrinaryIDSOperator(cls, char)
Checks if given character is a trinary IDS operator.

source code

bool

isIDSOperator(cls, char)
Checks if given character is an IDS operator.

source code

Class Variables

[hide private]

CHARARACTER_READING_MAPPING = {'Hangul': ('CharacterHangul', {...
A list of readings for which a character mapping exists including the database's table name and the reading dialect parameters.

Character stroke functions

_strokeLookup = None
A dictionary containing stroke forms for stroke abbreviations.

Character component functions

IDS_BINARY = [u'⿰', u'⿱', u'⿴', u'⿵', u'⿶', u'⿷', u'⿸', u'⿹', ...
A list of binary IDS operators used to describe character decompositions.

IDS_TRINARY = [u'⿲', u'⿳']
A list of trinary IDS operators used to describe character decompositions.

Method Details

Class CharacterLookup

Examples

Character locale

Z-variants

Z-variants and character locales

Kangxi radical functions

Character decomposition

Strokes

Readings

__init__(self, databaseUrl=None, dbConnectInst=None) (Constructor)

_getReadingFactory(self)

getCharactersForReading(self, readingString, readingN, **options)

getReadingForCharacter(self, char, readingN, **options)

_getCompatibleCharacterReading(self, readingN, toCharReading=True)

_locale(self, locale)

getCharacterVariants(self, char, variantType)

getAllCharacterVariants(self, char)

getLocaleDefaultZVariant(self, char, locale)

getCharacterZVariants(self, char)

getStrokeCount(self, char, locale=None, zVariant=0)

getStrokeCountDict(self)

getStrokeForAbbrev(self, abbrev)

getStrokeForName(self, name)

getStrokeOrder(self, char, locale=None, zVariant=0)

getCharacterKangxiRadicalIndex(self, char)

getCharacterKangxiRadicalResidualStrokeCount(self, char, locale=None, zVariant=0)

getCharacterRadicalResidualStrokeCount(self, char, radicalIndex, locale=None, zVariant=0)

getCharacterRadicalResidualStrokeCountDict(self)

getCharacterKangxiResidualStrokeCount(self, char, locale=None, zVariant=0)

getCharacterResidualStrokeCount(self, char, radicalIndex, locale=None, zVariant=0)

getCharacterResidualStrokeCountDict(self)

getCharactersForKangxiRadicalIndex(self, radicalIndex)

getCharactersForRadicalIndex(self, radicalIndex)

getResidualStrokeCountForKangxiRadicalIndex(self, radicalIndex)

getResidualStrokeCountForRadicalIndex(self, radicalIndex)

getKangxiRadicalForm(self, radicalIdx, locale)

getKangxiRadicalVariantForms(self, radicalIdx, locale)

getKangxiRadicalIndex(self, radicalForm, locale=None)

getKangxiRadicalRepresentativeCharacters(self, radicalIdx, locale)

isKangxiRadicalFormOrEquivalent(self, form, locale=None)

isRadicalChar(self, char)

getRadicalFormEquivalentCharacter(self, radicalForm, locale)

getCharacterEquivalentRadicalForms(self, equivalentForm, locale)

isBinaryIDSOperator(cls, char) Class Method

isTrinaryIDSOperator(cls, char) Class Method

isIDSOperator(cls, char) Class Method

getCharactersForComponents(self, componentList, locale, includeEquivalentRadicalForms=True, resultIncludeRadicalForms=False)

getCharactersForEquivalentComponents(self, componentConstruct, locale=None, resultIncludeRadicalForms=False)

getDecompositionEntries(self, char, locale=None, zVariant=0)

getDecompositionEntriesDict(self)

_getDecompositionFromString(self, decomposition)

getDecompositionTreeList(self, char, locale=None, zVariant=0)

isComponentInCharacter(self, component, char, locale=None, zVariant=0, componentZVariant=None)

CHARARACTER_READING_MAPPING

IDS_BINARY

init(self, databaseUrl=None, dbConnectInst=None)
(Constructor)

isBinaryIDSOperator(cls, char)
Class Method

isTrinaryIDSOperator(cls, char)
Class Method

isIDSOperator(cls, char)
Class Method