Package cjklib :: Module characterlookup :: Class CharacterLookup
[hide private]
[frames] | no frames]

Class CharacterLookup

source code

CharacterLookup provides access to lookup methods related to Han characters.

The real system of CharacterLookup lies in the database beneath where all relevant data is stored. So for nearly all methods this class needs access to a database. Thus on initialisation of the object a connection to a database is established, the logic for this provided by the DatabaseConnector.

See the DatabaseConnector for supported database systems.

CharacterLookup will try to read the config file from either /etc or the users home folder. If none is present it will try to open a SQLite database stored as db in the same folder by default. You can override this behaviour by specifying additional parameters on creation of the object.

Examples

The following examples should give a quick view into how to use this package.

Character locale

During the development of characters in the different cultures character appearances changed over time to that extent, that the handling of radicals, character components and strokes needs to be distinguished, depending on the locale.

To deal with this circumstance CharacterLookup works with a character locale. Most of the methods of this class ask for a locale to be specified. In these cases the output of the method depends on the specified locale.

For example in the traditional locale 这 has 8 strokes, but in simplified Chinese it has only 7, as the radical ⻌ has different stroke counts, depending on the locale.

Z-variants

One feature of Chinese characters is the glyph form describing the visual representation. This feature doesn't need to be unique and so many characters can be found in different writing variants e.g. character 福 (English: luck) which has numerous forms.

The Unicode Consortium does not include same characters of different actual shape in the Unicode standard (called Z-variants), except a few "double" entries which are included as to maintain backward compatibility. In fact a code point represents an abstract character not defining any visual representation. Thus a distinct appearance description including strokes and stroke order cannot be simply assigned to a code point but one needs to deal with the notion of Z-variants representing distinct glyphs to which a visual description can be applied.

The name Z-variant is derived from the three-dimensional model representing the space of characters relative to three axis, being the X axis representing the semantic space, the Y axis representing the abstract shape space and finally the Z axis for typeface differences (see "Principles of Han Unification" in: The Unicode Standard 5.0, chapter 12). Character presentations only differing in the Z dimension are generally unified.

cjklib tries to offer a simple approach to handle different Z-variants. As character components, strokes and the stroke order depend on this variant, methods dealing with this kind will ask for a Z-variant value to be specified. In these cases the output of the method depends on the specified variant.

Z-variants and character locales

Deviant stroke count, stroke order or decomposition into character components for different character locales is implemented using different Z-variants. For the example given above the entry 这 with 8 strokes is given as one Z-variant and the form with 7 strokes is given as another Z-variant.

In most cases one might only be interested in a single visual appearance, the "standard" one. This visual appearance would be the one generally used in the specific locale.

Instead of specifying a certain Z-variant most functions will allow for passing of a character locale. Giving the locale will apply the default Z-variant given by the mapping defined in the database which can be obtained by calling getLocaleDefaultZVariant().

More complex relations as which of several Z-variants for a given character are used in a given locale are not covered.

Kangxi radical functions

Using the Unihan database queries about the Kangxi radical of characters can be made. It is possible to get a Kangxi radical for a character or lookup all characters for a given radical.

Unicode has extra code points for radical forms (e.g. ⾔), here called Unicode radical forms, and radical variant forms (e.g. ⻈), here called Unicode radical variants. These characters should be used when explicitly referring to their function as radicals. For most of the radicals and variants their exist complementary character forms which have the same appearance (e.g. 言 and 讠) and which shall be called equivalent characters here.

Mapping from one to another side is not trivially possible, as some forms only exist as radical forms, some only as character forms, but from their meaning used in the radical context (called isolated radical characters here, e.g. 訁 for Kangxi radical 149).

Additionally a one to one mapping can't be guaranteed, as some forms have two or more equivalent forms in another domain, and mapping is highly dependant on the locale.

CharacterLookup provides methods for dealing with this different kinds of characters and the mapping between them.

Character decomposition

Many characters can be decomposed into two or more components, that again are Chinese characters. This fact can be used in many ways, including character lookup, finding patterns for font design or studying characters. Even the stroke order and stroke count can be deduced from the stroke information of the character's components.

Character decomposition is highly dependant on the appearance of the character, so both Z-variant and character locale need to be clear when looking at a decomposition into components.

More points render this task more complex: decomposition into one set of components is not distinct, some characters can be broken down into different sets. Furthermore sometimes one component can be given, but the other component will not be encoded as a character in its own right.

These components again might be characters that contain further components (again not distinct ones), thus a complex decomposition in several steps is possible.

The basis for the character decomposition lies in the database, where all decompositions are stored, using Ideographic Description Sequences (IDS). These sequences consist of Unicode IDS operators and characters to describe the structure of the character. There are binary IDS operators to describe decomposition into two components (e.g. ⿰ for one component left, one right as in 好: ⿰女子) or trinary IDS operators for decomposition into three components (e.g. ⿲ for three components from left to right as in 辨: ⿲⾟刂⾟). Using IDS operators it is possible to give a basic structural information, that in many cases is enough for example to derive a overall stroke order from two single sets of stroke orders. Further more it is possible to look for redundant information in different entries and thus helps to keep the definition data clean.

This class provides methods for retrieving the basic partition entries, lookup of characters by components and decomposing as a tree from the character as a root down to the minimal components as leaf nodes.

TODO: Policy about what to classify as partition.

Strokes

Chinese characters consist of different strokes as basic parts. These strokes are written in a mostly distinct order called the stroke order and have a distinct stroke count.

The stroke order in the writing of Chinese characters is important e.g. for calligraphy or students learning new characters and is normally fixed as there is only one possible stroke order for each character. Further more there is a fixed set of possible strokes and these strokes carry names.

As with character decomposition the stroke order and stroke count is highly dependant on the appearance of the character, so both Z-variant and character locale need to be known.

Further more the order of strokes can be useful for lookup of characters, and so CharacterLookup provides different methods for getting the stroke count, stroke order, lookup of stroke names and lookup of characters by stroke types and stroke order.

Most methods work with an abbreviation of stroke names using the first letters of each syllable of the Chinese name in Pinyin.

The stroke order is not always quite clear and even academics fight about which order should be considered the correct one, a discussion that shouldn't be taking lightly. This circumstance should be considered when working with stroke orders.

TODO: About plans of cjklib how to support different views on the stroke order

TODO: About the different classifications of strokes

Readings

See module reading for a detailed description.


See Also:

To Do (Lang): Add option to component decomposition methods to stop on Kangxi radical forms without breaking further down beyond those.

To Do (Fix): To Do (Impl):
Instance Methods [hide private]
 
__init__(self, databaseUrl=None, dbConnectInst=None)
Initialises the CharacterLookup.
source code
instance
_getReadingFactory(self)
Gets the ReadingFactory instance.
source code
str
_locale(self, locale)
Gets the locale search value for a database lookup on databases with character locale dependant content.
source code
    Character reading lookup
list of str
getCharactersForReading(self, readingString, readingN, **options)
Gets all know characters for the given reading.
source code
str
getReadingForCharacter(self, char, readingN, **options)
Gets all know readings for the character in the given target reading.
source code
str
_getCompatibleCharacterReading(self, readingN, toCharReading=True)
Gets a reading where a mapping from to Chinese characters is supported and that is compatible (a conversion is supported) to the given reading.
source code
    Character variant lookup
list of str
getCharacterVariants(self, char, variantType)
Gets the variant forms of the given type for the character.
source code
list of tuple
getAllCharacterVariants(self, char)
Gets all variant forms regardless of the type for the character.
source code
int
getLocaleDefaultZVariant(self, char, locale)
Gets the default Z-variant for the given character under the given locale.
source code
list of int
getCharacterZVariants(self, char)
Gets a list of character Z-variant indices (glyphs) supported by the database.
source code
    Character stroke functions
int
getStrokeCount(self, char, locale=None, zVariant=0)
Gets the stroke count for the given character.
source code
dict
getStrokeCountDict(self)
Gets the full stroke count table from the database.
source code
str
getStrokeForAbbrev(self, abbrev)
Gets the stroke form for the given abbreviated name (e.g.
source code
str
getStrokeForName(self, name)
Gets the stroke form for the given name (e.g.
source code
str
getStrokeOrder(self, char, locale=None, zVariant=0)
Gets the stroke order sequence for the given character.
source code
    Character radical functions
int
getCharacterKangxiRadicalIndex(self, char)
Gets the Kangxi radical index for the given character as defined by the Unihan database.
source code
list of tuple
getCharacterKangxiRadicalResidualStrokeCount(self, char, locale=None, zVariant=0)
Gets the Kangxi radical form (either a Unicode radical form or a Unicode radical variant) found as a component in the character and the stroke count of the residual character components.
source code
list of tuple
getCharacterRadicalResidualStrokeCount(self, char, radicalIndex, locale=None, zVariant=0)
Gets the radical form (either a Unicode radical form or a Unicode radical variant) found as a component in the character and the stroke count of the residual character components.
source code
dict
getCharacterRadicalResidualStrokeCountDict(self)
Gets the full table of radical forms (either a Unicode radical form or a Unicode radical variant) found as a component in the character and the stroke count of the residual character components from the database.
source code
int
getCharacterKangxiResidualStrokeCount(self, char, locale=None, zVariant=0)
Gets the stroke count of the residual character components when leaving aside the radical form.
source code
int
getCharacterResidualStrokeCount(self, char, radicalIndex, locale=None, zVariant=0)
Gets the stroke count of the residual character components when leaving aside the radical form.
source code
dict
getCharacterResidualStrokeCountDict(self)
Gets the full table of stroke counts of the residual character components from the database.
source code
list of str
getCharactersForKangxiRadicalIndex(self, radicalIndex)
Gets all characters for the given Kangxi radical index.
source code
list of str
getCharactersForRadicalIndex(self, radicalIndex)
Gets all characters for the given radical index.
source code
list of tuple
getResidualStrokeCountForKangxiRadicalIndex(self, radicalIndex)
Gets all characters and residual stroke count for the given Kangxi radical index.
source code
list of tuple
getResidualStrokeCountForRadicalIndex(self, radicalIndex)
Gets all characters and residual stroke count for the given radical index.
source code
    Radical form functions
str
getKangxiRadicalForm(self, radicalIdx, locale)
Gets a Unicode radical form for the given Kangxi radical index.
source code
list of str
getKangxiRadicalVariantForms(self, radicalIdx, locale)
Gets a list of Unicode radical variants for the given Kangxi radical index.
source code
int
getKangxiRadicalIndex(self, radicalForm, locale=None)
Gets the Kangxi radical index for the given form.
source code
list of str
getKangxiRadicalRepresentativeCharacters(self, radicalIdx, locale)
Gets a list of characters that represent the radical for the given Kangxi radical index.
source code
bool
isKangxiRadicalFormOrEquivalent(self, form, locale=None)
Checks if the given form is a Kangxi radical form or a radical equivalent.
source code
bool
isRadicalChar(self, char)
Checks if the given character is a Unicode radical form or Unicode radical variant.
source code
str
getRadicalFormEquivalentCharacter(self, radicalForm, locale)
Gets the equivalent character of the given Unicode radical form or Unicode radical variant.
source code
list of str
getCharacterEquivalentRadicalForms(self, equivalentForm, locale)
Gets Unicode radical forms or Unicode radical variants for the given equivalent character.
source code
    Character component functions
list of tuple
getCharactersForComponents(self, componentList, locale, includeEquivalentRadicalForms=True, resultIncludeRadicalForms=False)
Gets all characters that contain the given components.
source code
list of tuple
getCharactersForEquivalentComponents(self, componentConstruct, locale=None, resultIncludeRadicalForms=False)
Gets all characters that contain at least one component per list entry, sorted by stroke count if available.
source code
list
getDecompositionEntries(self, char, locale=None, zVariant=0)
Gets the decomposition of the given character into components from the database.
source code
dict
getDecompositionEntriesDict(self)
Gets the full decomposition table from the database.
source code
list
_getDecompositionFromString(self, decomposition)
Gets a tuple representation with character/Z-variant of the given character's decomposition into components.
source code
list
getDecompositionTreeList(self, char, locale=None, zVariant=0)
Gets the decomposition of the given character into components as a list of decomposition trees.
source code
bool
isComponentInCharacter(self, component, char, locale=None, zVariant=0, componentZVariant=None)
Checks if the given character contains the second character as a component.
source code
Class Methods [hide private]
    Character component functions
bool
isBinaryIDSOperator(cls, char)
Checks if given character is a binary IDS operator.
source code
bool
isTrinaryIDSOperator(cls, char)
Checks if given character is a trinary IDS operator.
source code
bool
isIDSOperator(cls, char)
Checks if given character is an IDS operator.
source code
Class Variables [hide private]
  CHARARACTER_READING_MAPPING = {'Hangul': ('CharacterHangul', {...
A list of readings for which a character mapping exists including the database's table name and the reading dialect parameters.
    Character stroke functions
  _strokeLookup = None
A dictionary containing stroke forms for stroke abbreviations.
    Character component functions
  IDS_BINARY = [u'', u'', u'', u'', u'', u'', u'', u'', ...
A list of binary IDS operators used to describe character decompositions.
  IDS_TRINARY = [u'', u'']
A list of trinary IDS operators used to describe character decompositions.
Method Details [hide private]

__init__(self, databaseUrl=None, dbConnectInst=None)
(Constructor)

source code 

Initialises the CharacterLookup.

If no parameters are given default values are assumed for the connection to the database. The database connection parameters can be given in databaseUrl, or an instance of DatabaseConnector can be passed in dbConnectInst, the latter one being preferred if both are specified.

Parameters:
  • databaseUrl (str) - database connection setting in the format driver://user:pass@host/database.
  • dbConnectInst (instance) - instance of a DatabaseConnector

_getReadingFactory(self)

source code 

Gets the ReadingFactory instance.

Returns: instance
a ReadingFactory instance.

getCharactersForReading(self, readingString, readingN, **options)

source code 

Gets all know characters for the given reading.

Parameters:
  • readingString (str) - reading string for lookup
  • readingN (str) - name of reading
  • options - additional options for handling the reading input
Returns: list of str
list of characters for the given reading
Raises:
  • UnsupportedError - if no mapping between characters and target reading exists.
  • ConversionError - if conversion from the internal source reading to the given target reading fails.

getReadingForCharacter(self, char, readingN, **options)

source code 

Gets all know readings for the character in the given target reading.

Parameters:
  • char (str) - Chinese character for lookup
  • readingN (str) - name of target reading
  • options - additional options for handling the reading output
Returns: str
list of readings for the given character
Raises:
  • UnsupportedError - if no mapping between characters and target reading exists.
  • ConversionError - if conversion from the internal source reading to the given target reading fails.

_getCompatibleCharacterReading(self, readingN, toCharReading=True)

source code 

Gets a reading where a mapping from to Chinese characters is supported and that is compatible (a conversion is supported) to the given reading.

Parameters:
  • readingN (str) - name of reading
  • toCharReading (bool) - True if conversion is done in direction to the given reading, False otherwise
Returns: str
a reading that is compatible to the given one and where character lookup is supported
Raises:
  • UnsupportedError - if no mapping between characters and target reading exists.

_locale(self, locale)

source code 

Gets the locale search value for a database lookup on databases with character locale dependant content.

Parameters:
  • locale (str) - character locale (one out of TCJKV)
Returns: str
search locale used for SQL select
Raises:
  • ValueError - if invalid character locale specified

To Do (Fix): This probably requires a full table scan

getCharacterVariants(self, char, variantType)

source code 

Gets the variant forms of the given type for the character.

The type can be one out of:

  • C, compatible character form (if character was added to Unicode to maintain compatibility and round-trip convertibility)
  • M, semantic variant forms, which are often used interchangeably instead of the character.
  • P, specialised semantic variant forms, which are often used interchangeably instead of the character but limited to certain contexts.
  • Z, Z-variant forms, which only differ in typeface (and would have been unified if not to maintain round trip convertibility)
  • S, simplified Chinese character forms, originating from the character simplification process of the PR China.
  • T, traditional character forms for a simplified Chinese character.

Variants depend on the locale which is not taken into account here. Thus some of the returned characters might be only be variants under some locales.

Parameters:
  • char (str) - Chinese character
  • variantType (str) - type of variant(s) to be returned
Returns: list of str
list of character variant(s) of given type

To Do (Lang): What is the difference on Z-variants and compatible variants? Some links between two characters are bidirectional, some not. Is there any rule?

To Do (Impl): Give a source on variant information as information can contradict itself (http://www.unicode.org/reports/tr38/tr38-5.html#N10211). See 呆 (U+5446) which has one form each for semantic and specialised semantic, each derived from a different source. Change also in getAllCharacterVariants().

To Do (Docu): Write about different kinds of variants

getAllCharacterVariants(self, char)

source code 

Gets all variant forms regardless of the type for the character.

A list of tuples is returned, including the character and its variant type. See getCharacterVariants() for variant types.

Variants depend on the locale which is not taken into account here. Thus some of the returned characters might be only be variants under some locales.

Parameters:
  • char (str) - Chinese character
Returns: list of tuple
list of character variant(s) with their type

getLocaleDefaultZVariant(self, char, locale)

source code 

Gets the default Z-variant for the given character under the given locale.

The Z-variant returned is an index to the internal database of different character glyphs and represents the most common glyph used under the given locale.

Parameters:
  • char (str) - Chinese character
  • locale (str) - character locale (one out of TCJKV)
Returns: int
Z-variant
Raises:
  • NoInformationError - if no Z-variant information is available
  • ValueError - if invalid character locale specified

getCharacterZVariants(self, char)

source code 

Gets a list of character Z-variant indices (glyphs) supported by the database.

A Z-variant index specifies a particular character glyph which is needed by several glyph-dependant methods instead of the abstract character defined by Unicode.

Parameters:
  • char (str) - Chinese character
Returns: list of int
list of supported Z-variants
Raises:

getStrokeCount(self, char, locale=None, zVariant=0)

source code 

Gets the stroke count for the given character.

Parameters:
  • char (str) - Chinese character
  • locale (str) - character locale (one out of TCJKV). Giving the locale will apply the default Z-variant defined by getLocaleDefaultZVariant(). The Z-variant supplied with option zVariant will be ignored.
  • zVariant (int) - Z-variant of the first character
Returns: int
stroke count of given character
Raises:
  • NoInformationError - if no stroke count information available
  • ValueError - if an invalid character locale is specified

Attention: The quality of the returned data depends on the sources used when compiling the database. Unihan itself only gives very general stroke order information without being bound to a specific glyph.

getStrokeCountDict(self)

source code 

Gets the full stroke count table from the database.

Returns: dict
dictionary of key pair character, Z-variant and value stroke count

Attention: The quality of the returned data depends on the sources used when compiling the database. Unihan itself only gives very general stroke order information without being bound to a specific glyph.

getStrokeForAbbrev(self, abbrev)

source code 

Gets the stroke form for the given abbreviated name (e.g. 'HZ').

Parameters:
  • abbrev (str) - abbreviated stroke name
Returns: str
Unicode stroke character
Raises:
  • ValueError - if invalid stroke abbreviation is specified

getStrokeForName(self, name)

source code 

Gets the stroke form for the given name (e.g. '横折').

Parameters:
  • name (str) - Chinese name of stroke
Returns: str
Unicode stroke char
Raises:
  • ValueError - if invalid stroke name is specified

getStrokeOrder(self, char, locale=None, zVariant=0)

source code 

Gets the stroke order sequence for the given character.

The stroke order is constructed using the character decomposition into components. As the stroke order information for some components might be not obtainable the returned stroke order might be partial.

Parameters:
  • char (str) - Chinese character
  • locale (str) - character locale (one out of TCJKV). Giving the locale will apply the default Z-variant defined by getLocaleDefaultZVariant(). The Z-variant supplied with option zVariant will be ignored.
  • zVariant (int) - Z-variant of the first character
Returns: str
string of stroke abbreviations separated by spaces and hyphens.
Raises:
  • ValueError - if an invalid character locale is specified
  • NoInformationError - if no stroke order information available

To Do (Lang): Add stroke order source to stroke order data so that in general different and contradicting stroke order information can be given. The user then could prefer several sources that in the order given would be queried.

getCharacterKangxiRadicalIndex(self, char)

source code 

Gets the Kangxi radical index for the given character as defined by the Unihan database.

Parameters:
  • char (str) - Chinese character
Returns: int
Kangxi radical index
Raises:

getCharacterKangxiRadicalResidualStrokeCount(self, char, locale=None, zVariant=0)

source code 

Gets the Kangxi radical form (either a Unicode radical form or a Unicode radical variant) found as a component in the character and the stroke count of the residual character components.

The representation of the included radical or radical variant form depends on the respective character variant and thus the form's Z-variant is returned. Some characters include the given radical more than once and in some cases the representation is different between those same forms thus in the general case several matches can be returned each entry with a different radical form Z-variant. In these cases the entries are sorted by their Z-variant.

There are characters which include both, the radical form and a variant form of the radical (e.g. 伦: 人 and 亻). In these cases both are returned.

This method will return radical forms regardless of the selected locale, e.g. radical ⻔ is returned for character 间, though this variant form is not recognised under a traditional locale (like the character itself).

Parameters:
  • char (str) - Chinese character
  • locale (str) - character locale (one out of TCJKV). Giving the locale will apply the default Z-variant defined by getLocaleDefaultZVariant(). The Z-variant supplied with option zVariant will be ignored.
  • zVariant (int) - Z-variant of the first character
Returns: list of tuple
list of radical/variant form, its Z-variant, the main layout of the character (using a IDS operator), the position of the radical wrt. layout (0, 1 or 2) and the residual stroke count.
Raises:
  • NoInformationError - if no stroke count information available
  • ValueError - if an invalid character locale is specified

getCharacterRadicalResidualStrokeCount(self, char, radicalIndex, locale=None, zVariant=0)

source code 

Gets the radical form (either a Unicode radical form or a Unicode radical variant) found as a component in the character and the stroke count of the residual character components.

This is a more general version of getCharacterKangxiRadicalResidualStrokeCount() which is not limited to the mapping of characters to a Kangxi radical as done by Unihan.

Parameters:
  • char (str) - Chinese character
  • radicalIndex (int) - radical index
  • locale (str) - character locale (one out of TCJKV). Giving the locale will apply the default Z-variant defined by getLocaleDefaultZVariant(). The Z-variant supplied with option zVariant will be ignored.
  • zVariant (int) - Z-variant of the first character
Returns: list of tuple
list of radical/variant form, its Z-variant, the main layout of the character (using a IDS operator), the position of the radical wrt. layout (0, 1 or 2) and the residual stroke count.
Raises:
  • NoInformationError - if no stroke count information available
  • ValueError - if an invalid character locale is specified
To Do (Lang):
  • Clarify on characters classified under a given radical but without any proper radical glyph found as component.
  • Clarify on different radical zVariants for the same radical form. At best this method should return one and only one radical form (glyph).

To Do (Impl): Give the Unicode radical form and not the equivalent character form in the relevant table as to always return the pure radical form (also avoids duplicates). Then state:

If the included component has an appropriate Unicode radical form or Unicode radical variant, then this form is returned. In either case the radical form can be an ordinary character.

getCharacterRadicalResidualStrokeCountDict(self)

source code 

Gets the full table of radical forms (either a Unicode radical form or a Unicode radical variant) found as a component in the character and the stroke count of the residual character components from the database.

A typical entry looks like (u'众', 0): {9: [(u'人', 0, u'⿱', 0, 4), (u'人', 0, u'⿻', 0, 4)]}, and can be accessed as radicalDict[(u'众', 0)][9] with the Chinese character, its Z-variant and Kangxi radical index. The values are given in the order radical form, radical Z-variant, character layout, relative position of the radical and finally the residual stroke count.

Returns: dict
dictionary of radical/residual stroke count entries.

getCharacterKangxiResidualStrokeCount(self, char, locale=None, zVariant=0)

source code 

Gets the stroke count of the residual character components when leaving aside the radical form.

This method returns a subset of data with regards to getCharacterKangxiRadicalResidualStrokeCount(). It may though offer more entries after all, as their might exists information only about the residual stroke count, but not about the concrete radical form.

Parameters:
  • char (str) - Chinese character
  • locale (str) - character locale (one out of TCJKV). Giving the locale will apply the default Z-variant defined by getLocaleDefaultZVariant(). The Z-variant supplied with option zVariant will be ignored.
  • zVariant (int) - Z-variant of the first character
Returns: int
residual stroke count
Raises:
  • NoInformationError - if no stroke count information available
  • ValueError - if an invalid character locale is specified

Attention: The quality of the returned data depends on the sources used when compiling the database. Unihan itself only gives very general stroke order information without being bound to a specific glyph.

getCharacterResidualStrokeCount(self, char, radicalIndex, locale=None, zVariant=0)

source code 

Gets the stroke count of the residual character components when leaving aside the radical form.

This is a more general version of getCharacterKangxiResidualStrokeCount() which is not limited to the mapping of characters to a Kangxi radical as done by Unihan.

Parameters:
  • char (str) - Chinese character
  • radicalIndex (int) - radical index
  • locale (str) - character locale (one out of TCJKV). Giving the locale will apply the default Z-variant defined by getLocaleDefaultZVariant(). The Z-variant supplied with option zVariant will be ignored.
  • zVariant (int) - Z-variant of the first character
Returns: int
residual stroke count
Raises:
  • NoInformationError - if no stroke count information available
  • ValueError - if an invalid character locale is specified

Attention: The quality of the returned data depends on the sources used when compiling the database. Unihan itself only gives very general stroke order information without being bound to a specific glyph.

getCharacterResidualStrokeCountDict(self)

source code 

Gets the full table of stroke counts of the residual character components from the database.

A typical entry looks like (u'众', 0): {9: [4]}, and can be accessed as residualCountDict[(u'众', 0)][9] with the Chinese character, its Z-variant and Kangxi radical index which then gives the residual stroke count.

Returns: dict
dictionary of radical/residual stroke count entries.

getCharactersForKangxiRadicalIndex(self, radicalIndex)

source code 

Gets all characters for the given Kangxi radical index.

Parameters:
  • radicalIndex (int) - Kangxi radical index
Returns: list of str
list of matching Chinese characters

To Do (Lang): 6954 characters have no Kangxi radical. Provide integration for these (SELECT COUNT(*) FROM Unihan WHERE kRSUnicode IS NOT NULL AND kRSKangxi IS NULL;).

To Do (Docu): Write about how Unihan maps characters to a Kangxi radical. Especially Chinese simplified characters.

getCharactersForRadicalIndex(self, radicalIndex)

source code 

Gets all characters for the given radical index.

This is a more general version of getCharactersForKangxiRadicalIndex() which is not limited to the mapping of characters to a Kangxi radical as done by Unihan and one character can show up under several different radical indices.

Parameters:
  • radicalIndex (int) - Kangxi radical index
Returns: list of str
list of matching Chinese characters

getResidualStrokeCountForKangxiRadicalIndex(self, radicalIndex)

source code 

Gets all characters and residual stroke count for the given Kangxi radical index.

This brings together methods getCharactersForKangxiRadicalIndex() and getCharacterResidualStrokeCountDict() and reports all characters including the given Kangxi radical, additionally supplying the residual stroke count.

Parameters:
  • radicalIndex (int) - Kangxi radical index
Returns: list of tuple
list of matching Chinese characters with residual stroke count

getResidualStrokeCountForRadicalIndex(self, radicalIndex)

source code 

Gets all characters and residual stroke count for the given radical index.

This brings together methods getCharactersForRadicalIndex() and getCharacterResidualStrokeCountDict() and reports all characters including the given radical without being limited to the mapping of characters to a Kangxi radical as done by Unihan, additionally supplying the residual stroke count.

Parameters:
  • radicalIndex (int) - Kangxi radical index
Returns: list of tuple
list of matching Chinese characters with residual stroke count

getKangxiRadicalForm(self, radicalIdx, locale)

source code 

Gets a Unicode radical form for the given Kangxi radical index.

This method will always return a single non null value, even if there are several radical forms for one index.

Parameters:
  • radicalIdx (int) - Kangxi radical index
  • locale (str) - character locale (one out of TCJKV)
Returns: str
Unicode radical form
Raises:
  • ValueError - if an invalid character locale or radical index is specified

To Do (Lang): Check if radicals for which multiple radical forms exists include a simplified form or other variation (e.g. ⻆, ⻝, ⺐). There are radicals for which a Chinese simplified character equivalent exists and that is mapped to a different radical under Unicode.

getKangxiRadicalVariantForms(self, radicalIdx, locale)

source code 

Gets a list of Unicode radical variants for the given Kangxi radical index.

This method can return an empty list if there are no Unicode radical variant forms. There might be non Unicode radical variants for this radial as character forms though.

Parameters:
  • radicalIdx (int) - Kangxi radical index
  • locale (str) - character locale (one out of TCJKV)
Returns: list of str
list of Unicode radical variants
Raises:
  • ValueError - if an invalid character locale is specified

To Do (Lang): Narrow locales, not all variant forms are valid under all locales.

getKangxiRadicalIndex(self, radicalForm, locale=None)

source code 

Gets the Kangxi radical index for the given form.

The given form might either be an Unicode radical form or an equivalent character.

If there is an entry for the given radical form it still might not be a radical under the given character locale. So specifying a locale allows strict radical handling.

Parameters:
  • radicalForm (str) - radical form
  • locale (str) - optional character locale (one out of TCJKV)
Returns: int
Kangxi radical index
Raises:
  • ValueError - if invalid character locale or radical form is specified

getKangxiRadicalRepresentativeCharacters(self, radicalIdx, locale)

source code 

Gets a list of characters that represent the radical for the given Kangxi radical index.

This includes the radical form(s), character equivalents and variant forms and equivalents.

E.g. character for to speak/to say/talk/word (Pinyin yán): ⾔ (0x2f94), 言 (0x8a00), ⻈ (0x2ec8), 讠 (0x8ba0), 訁 (0x8a01)

Parameters:
  • radicalIdx (int) - Kangxi radical index
  • locale (str) - character locale (one out of TCJKV)
Returns: list of str
list of Chinese characters representing the radical for the given index, including Unicode radical and variant forms and their equivalent real character forms
Raises:
  • ValueError - if invalid character locale specified

isKangxiRadicalFormOrEquivalent(self, form, locale=None)

source code 

Checks if the given form is a Kangxi radical form or a radical equivalent. This includes Unicode radical forms, Unicode radical variants, equivalent character and isolated radical characters.

If there is an entry for the given radical form it still might not be a radical under the given character locale. So specifying a locale allows strict radical handling.

Parameters:
  • form (str) - Chinese character
  • locale (str) - optional character locale (one out of TCJKV)
Returns: bool
True if given form is a radical or equivalent character, False otherwise
Raises:
  • ValueError - if an invalid character locale is specified

isRadicalChar(self, char)

source code 

Checks if the given character is a Unicode radical form or Unicode radical variant.

This method does a quick Unicode code index checking. So there is no guarantee this form has actually a radical entry in the database.

Parameters:
  • char (str) - Chinese character
Returns: bool
True if given form is a radical form, False otherwise

getRadicalFormEquivalentCharacter(self, radicalForm, locale)

source code 

Gets the equivalent character of the given Unicode radical form or Unicode radical variant.

The mapping mostly follows the Han Radical folding specified in the Draft Unicode Technical Report #30 Character Foldings under http://www.unicode.org/unicode/reports/tr30/#HanRadicalFolding. All radical forms except U+2E80 (⺀) have an equivalent character. These equivalent characters are not necessarily visual identical and can be subject to major variation.

This method may raise a UnsupportedError if there is no supported equivalent character form.

Parameters:
  • radicalForm (str) - Unicode radical form
  • locale (str) - character locale (one out of TCJKV)
Returns: str
equivalent character form
Raises:
  • UnsupportedError - if there is no supported equivalent character form
  • ValueError - if invalid character locale or radical form is specified

getCharacterEquivalentRadicalForms(self, equivalentForm, locale)

source code 

Gets Unicode radical forms or Unicode radical variants for the given equivalent character.

The mapping mostly follows the Han Radical folding specified in the Draft Unicode Technical Report #30 Character Foldings under http://www.unicode.org/unicode/reports/tr30/#HanRadicalFolding. Several radical forms can be mapped to the same equivalent character and thus this method in general returns several values.

Parameters:
  • equivalentForm (str) - Equivalent character of Unicode radical form or Unicode radical variant
  • locale (str) - character locale (one out of TCJKV)
Returns: list of str
equivalent character forms
Raises:
  • ValueError - if invalid character locale or equivalent character is specified

isBinaryIDSOperator(cls, char)
Class Method

source code 

Checks if given character is a binary IDS operator.

Parameters:
  • char (str) - Chinese character
Returns: bool
True if binary IDS operator, False otherwise

isTrinaryIDSOperator(cls, char)
Class Method

source code 

Checks if given character is a trinary IDS operator.

Parameters:
  • char (str) - Chinese character
Returns: bool
True if trinary IDS operator, False otherwise

isIDSOperator(cls, char)
Class Method

source code 

Checks if given character is an IDS operator.

Parameters:
  • char (str) - Chinese character
Returns: bool
True if IDS operator, False otherwise

getCharactersForComponents(self, componentList, locale, includeEquivalentRadicalForms=True, resultIncludeRadicalForms=False)

source code 

Gets all characters that contain the given components.

If option includeEquivalentRadicalForms is set, all equivalent forms will be search for when a Kangxi radical is given.

Parameters:
  • componentList (list of str) - list of character components
  • locale (str) - character locale (one out of TCJKV)
  • includeEquivalentRadicalForms (bool) - if True then characters in the given component list are interpreted as representatives for their radical and all radical forms are included in the search. E.g. 肉 will include ⺼ as a possible component.
  • resultIncludeRadicalForms (bool) - if True the result will include Unicode radical forms and Unicode radical variants
Returns: list of tuple
list of pairs of matching characters and their Z-variants
Raises:
  • ValueError - if an invalid character locale is specified

To Do (Lang): By default we get the equivalent character for a radical form. In some cases these equivalent characters will be only abstractly related to the given radical form (e.g. being the main radical form), so that the result set will be too big and doesn't reflect the original query. Set up a table including only strict visual relations between radical forms and equivalent characters. Alternatively restrict decomposition data to only include radical forms if appropriate, so there would be no need for conversion.

To Do (Data):
  • Adopt locale dependant Z-variants for parent characters (e.g. 鬼 in 隗 愧 嵬).
  • Use radical forms and radical variant forms instead of equivalent characters in decomposition data. Mapping looses information.

To Do (Impl): Table of same character glyphs, including special radical forms (e.g. 言 and 訁).

getCharactersForEquivalentComponents(self, componentConstruct, locale=None, resultIncludeRadicalForms=False)

source code 

Gets all characters that contain at least one component per list entry, sorted by stroke count if available.

This is the general form of getCharactersForComponents() and allows a set of characters per list entry of which at least one character must be a component in the given list.

If a character locale is specified only characters will be returned for which the locale's default Z-variant's decomposition will apply to the given components. Otherwise all Z-variants will be considered.

Parameters:
  • componentConstruct (list of list of str) - list of character components given as single characters or, for alternative characters, given as a list
  • resultIncludeRadicalForms (bool) - if True the result will include Unicode radical forms and Unicode radical variants
  • locale (str) - character locale (one out of TCJKV)
Returns: list of tuple
list of pairs of matching characters and their Z-variants
Raises:
  • ValueError - if an invalid character locale is specified

getDecompositionEntries(self, char, locale=None, zVariant=0)

source code 

Gets the decomposition of the given character into components from the database. The resulting decomposition is only the first layer in a tree of possible paths along the decomposition as the components can be further subdivided.

There can be several decompositions for one character so a list of decomposition is returned.

Each entry in the result list consists of a list of characters (with its Z-variant) and IDS operators.

Parameters:
  • char (str) - Chinese character that is to be decomposed into components
  • locale (str) - character locale (one out of TCJKV). Giving the locale will apply the default Z-variant defined by getLocaleDefaultZVariant(). The Z-variant supplied with option zVariant will be ignored.
  • zVariant (int) - Z-variant of the first character
Returns: list
list of first layer decompositions
Raises:
  • ValueError - if an invalid character locale is specified

getDecompositionEntriesDict(self)

source code 

Gets the full decomposition table from the database.

Returns: dict
dictionary with key pair character, Z-variant and the first layer decomposition as value

_getDecompositionFromString(self, decomposition)

source code 

Gets a tuple representation with character/Z-variant of the given character's decomposition into components.

Example: Entry ⿱尚[1]儿 will be returned as [u'⿱', (u'尚', 1), (u'儿', 0)].

Parameters:
  • decomposition (str) - character decomposition with IDS operator, compontens and optional Z-variant index
Returns: list
decomposition with character/Z-variant tuples

getDecompositionTreeList(self, char, locale=None, zVariant=0)

source code 

Gets the decomposition of the given character into components as a list of decomposition trees.

There can be several decompositions for one character so one tree per decomposition is returned.

Each entry in the result list consists of a list of characters (with its Z-variant and list of further decomposition) and IDS operators. If a character can be further subdivided, its containing list is non empty and includes yet another list of trees for the decomposition of the component.

Parameters:
  • char (str) - Chinese character that is to be decomposed into components
  • locale (str) - character locale (one out of TCJKV). Giving the locale will apply the default Z-variant defined by getLocaleDefaultZVariant(). The Z-variant supplied with option zVariant will be ignored.
  • zVariant (int) - Z-variant of the first character
Returns: list
list of decomposition trees
Raises:
  • ValueError - if an invalid character locale is specified

isComponentInCharacter(self, component, char, locale=None, zVariant=0, componentZVariant=None)

source code 

Checks if the given character contains the second character as a component.

Parameters:
  • component (str) - character questioned to be a component
  • char (str) - Chinese character
  • locale (str) - character locale (one out of TCJKV). Giving the locale will apply the default Z-variant defined by getLocaleDefaultZVariant(). The Z-variant supplied with option zVariant will be ignored.
  • zVariant (int) - Z-variant of the first character
  • componentZVariant (int) - Z-variant of the component; if left out every Z-variant matches for that character.
Returns: bool
True if component is a component of the given character, False otherwise
Raises:
  • ValueError - if an invalid character locale is specified

To Do (Impl): Implement means to check if the component is really not found, or if our data is just insufficient.


Class Variable Details [hide private]

CHARARACTER_READING_MAPPING

A list of readings for which a character mapping exists including the database's table name and the reading dialect parameters.

On conversion the first matching reading will be selected, so supplying several equivalent readings has limited use.

Value:
{'Hangul': ('CharacterHangul', {}),
 'Jyutping': ('CharacterJyutping', {'case': 'lower'}),
 'Pinyin': ('CharacterPinyin',
            {'case': 'lower', 'toneMarkType': 'Numbers'})}

IDS_BINARY

A list of binary IDS operators used to describe character decompositions.

Value:
[u'', u'', u'', u'', u'', u'', u'', u'', u'', u'']