Class CharacterLookup
source code
CharacterLookup provides access to lookup methods related to Han
characters.
The real system of CharacterLookup lies in the database beneath where
all relevant data is stored. So for nearly all methods this class needs
access to a database. Thus on initialisation of the object a connection
to a database is established, the logic for this provided by the DatabaseConnector.
See the DatabaseConnector for supported database systems.
CharacterLookup will try to read the config file from either /etc or
the users home folder. If none is present it will try to open a SQLite
database stored as db
in the same folder by default. You can
override this behaviour by specifying additional parameters on creation
of the object.
Examples
The following examples should give a quick view into how to use this
package.
-
Create the CharacterLookup object with default settings (read from
cjklib.conf or 'cjklib.db' in same directory as default):
>>> from cjklib import characterlookup
>>> cjk = characterlookup.CharacterLookup()
-
Get a list of characters, that are pronounced "국" in
Korean:
>>> cjk.getCharactersForReading(u'국', 'Hangul')
[u'匊', u'國', u'局', u'掬', u'菊', u'跼', u'鞠', u'鞫', u'麯', u'麴']
-
Check if a character is included in another character as a
component:
>>> cjk.isComponentInCharacter(u'女', u'好')
True
-
Get all Kangxi radical variants for Radical 184 (⾷) under the
traditional locale:
>>> cjk.getKangxiRadicalVariantForms(184, 'T')
[u'⻞', u'⻟']
Character locale
During the development of characters in the different cultures
character appearances changed over time to that extent, that the
handling of radicals, character components and strokes needs to be
distinguished, depending on the locale.
To deal with this circumstance CharacterLookup works with a
character locale. Most of the methods of this class ask for a locale to
be specified. In these cases the output of the method depends on the
specified locale.
For example in the traditional locale 这 has 8 strokes, but in
simplified Chinese it has only 7, as the radical ⻌ has different stroke
counts, depending on the locale.
Z-variants
One feature of Chinese characters is the glyph form describing the
visual representation. This feature doesn't need to be unique and so
many characters can be found in different writing variants e.g.
character 福 (English: luck) which has numerous forms.
The Unicode Consortium does not include same characters of different
actual shape in the Unicode standard (called Z-variants), except
a few "double" entries which are included as to maintain
backward compatibility. In fact a code point represents an abstract
character not defining any visual representation. Thus a distinct
appearance description including strokes and stroke order cannot be
simply assigned to a code point but one needs to deal with the notion
of Z-variants representing distinct glyphs to which a visual
description can be applied.
The name Z-variant is derived from the three-dimensional model
representing the space of characters relative to three axis, being the
X axis representing the semantic space, the Y axis representing the
abstract shape space and finally the Z axis for typeface differences
(see "Principles of Han Unification" in: The Unicode Standard
5.0, chapter 12). Character presentations only differing in the Z
dimension are generally unified.
cjklib tries to offer a simple approach to handle different
Z-variants. As character components, strokes and the stroke order
depend on this variant, methods dealing with this kind will ask for a
Z-variant value to be specified. In these cases the output of
the method depends on the specified variant.
Z-variants and character locales
Deviant stroke count, stroke order or decomposition into character
components for different character locales is implemented
using different Z-variants. For the example given above the
entry 这 with 8 strokes is given as one Z-variant and the form with 7
strokes is given as another Z-variant.
In most cases one might only be interested in a single visual
appearance, the "standard" one. This visual appearance
would be the one generally used in the specific locale.
Instead of specifying a certain Z-variant most functions will
allow for passing of a character locale. Giving the locale will apply
the default Z-variant given by the mapping defined in the database
which can be obtained by calling getLocaleDefaultZVariant().
More complex relations as which of several Z-variants for a given
character are used in a given locale are not covered.
Kangxi radical functions
Using the Unihan database queries about the Kangxi radical of
characters can be made. It is possible to get a Kangxi radical for a
character or lookup all characters for a given radical.
Unicode has extra code points for radical forms (e.g. ⾔), here
called Unicode radical forms, and radical variant forms
(e.g. ⻈), here called Unicode radical variants. These characters should
be used when explicitly referring to their function as radicals. For
most of the radicals and variants their exist complementary character
forms which have the same appearance (e.g. 言 and 讠) and which shall be
called equivalent characters here.
Mapping from one to another side is not trivially possible, as some
forms only exist as radical forms, some only as character forms, but
from their meaning used in the radical context (called isolated radical characters here, e.g. 訁 for
Kangxi radical 149).
Additionally a one to one mapping can't be guaranteed, as some forms
have two or more equivalent forms in another domain, and mapping is
highly dependant on the locale.
CharacterLookup provides methods for dealing with this different
kinds of characters and the mapping between them.
Character decomposition
Many characters can be decomposed into two or more components, that
again are Chinese characters. This fact can be used in many ways,
including character lookup, finding patterns for font design or
studying characters. Even the stroke order and stroke count can be
deduced from the stroke information of the character's components.
Character decomposition is highly dependant on the appearance of the
character, so both Z-variant and character locale need to
be clear when looking at a decomposition into components.
More points render this task more complex: decomposition into one
set of components is not distinct, some characters can be broken down
into different sets. Furthermore sometimes one component can be given,
but the other component will not be encoded as a character in its own
right.
These components again might be characters that contain further
components (again not distinct ones), thus a complex decomposition in
several steps is possible.
The basis for the character decomposition lies in the database,
where all decompositions are stored, using Ideographic Description Sequences (IDS).
These sequences consist of Unicode IDS operators and characters to describe the
structure of the character. There are binary IDS
operators to describe decomposition into two components (e.g. ⿰ for
one component left, one right as in 好: ⿰女子) or trinary IDS
operators for decomposition into three components (e.g. ⿲ for three
components from left to right as in 辨: ⿲⾟刂⾟). Using IDS
operators it is possible to give a basic structural information,
that in many cases is enough for example to derive a overall stroke
order from two single sets of stroke orders. Further more it is
possible to look for redundant information in different entries and
thus helps to keep the definition data clean.
This class provides methods for retrieving the basic partition
entries, lookup of characters by components and decomposing as a tree
from the character as a root down to the minimal
components as leaf nodes.
TODO: Policy about what to classify as partition.
Strokes
Chinese characters consist of different strokes as basic parts.
These strokes are written in a mostly distinct order called the stroke order and
have a distinct stroke count.
The stroke order in the writing of Chinese characters is
important e.g. for calligraphy or students learning new characters and
is normally fixed as there is only one possible stroke order for each
character. Further more there is a fixed set of possible strokes and
these strokes carry names.
As with character decomposition the stroke order and
stroke count is highly dependant on the appearance of the
character, so both Z-variant and character locale need to
be known.
Further more the order of strokes can be useful for lookup of
characters, and so CharacterLookup provides different methods for
getting the stroke count, stroke order, lookup of stroke names and
lookup of characters by stroke types and stroke order.
Most methods work with an abbreviation of stroke names using the
first letters of each syllable of the Chinese name in Pinyin.
The stroke order is not always quite clear and even academics
fight about which order should be considered the correct one, a
discussion that shouldn't be taking lightly. This circumstance should
be considered when working with stroke orders.
TODO: About plans of cjklib how to support different views on the
stroke order
TODO: About the different classifications of strokes
Readings
See module reading for a detailed description.
See Also:
To Do (Lang):
Add option to component decomposition methods to stop on Kangxi radical
forms without breaking further down beyond those.
To Do (Fix):
-
Incorporate stroke lookup (bigram) techniques
-
How to handle character forms (either decomposition or stroke order),
that can only be found as a component in other characters? We already
mark them by flagging it with an 'S'.
To Do (Impl):
-
Think about applying locale at object creation time and not passing it
on every method call. Would make the class easier to use.
-
Create a method for specifying which character range is of interest for
the return values of methods. Narrowing the return results is a further
way to locale dependant responses. E.g. cjknife could take this into
account when only displaying characters that can be displayed with the
current locale (BIG5, GBK...).
|
__init__(self,
databaseUrl=None,
dbConnectInst=None)
Initialises the CharacterLookup. |
source code
|
|
instance
|
|
str
|
_locale(self,
locale)
Gets the locale search value for a database lookup on databases with
character locale dependant content. |
source code
|
|
list of str
|
|
str
|
|
str
|
_getCompatibleCharacterReading(self,
readingN,
toCharReading=True)
Gets a reading where a mapping from to Chinese characters is
supported and that is compatible (a conversion is supported) to the
given reading. |
source code
|
|
list of str
|
|
list of tuple
|
|
int
|
|
list of int
|
|
int
|
|
dict
|
|
str
|
|
str
|
|
str
|
|
int
|
|
list of tuple
|
getCharacterKangxiRadicalResidualStrokeCount(self,
char,
locale=None,
zVariant=0)
Gets the Kangxi radical form (either a Unicode radical form or
a Unicode radical variant) found as a component in the
character and the stroke count of the residual character components. |
source code
|
|
list of tuple
|
getCharacterRadicalResidualStrokeCount(self,
char,
radicalIndex,
locale=None,
zVariant=0)
Gets the radical form (either a Unicode radical form or a
Unicode radical variant) found as a component in the character
and the stroke count of the residual character components. |
source code
|
|
dict
|
getCharacterRadicalResidualStrokeCountDict(self)
Gets the full table of radical forms (either a Unicode radical
form or a Unicode radical variant) found as a component in
the character and the stroke count of the residual character
components from the database. |
source code
|
|
int
|
|
int
|
|
dict
|
|
list of str
|
|
list of str
|
|
list of tuple
|
|
list of tuple
|
|
str
|
|
list of str
|
|
int
|
|
list of str
|
|
bool
|
|
bool
|
isRadicalChar(self,
char)
Checks if the given character is a Unicode radical form or
Unicode radical variant. |
source code
|
|
str
|
|
list of str
|
|
list of tuple
|
getCharactersForComponents(self,
componentList,
locale,
includeEquivalentRadicalForms=True,
resultIncludeRadicalForms=False)
Gets all characters that contain the given components. |
source code
|
|
list of tuple
|
|
list
|
|
dict
|
|
list
|
|
list
|
|
bool
|
isComponentInCharacter(self,
component,
char,
locale=None,
zVariant=0,
componentZVariant=None)
Checks if the given character contains the second character as a
component. |
source code
|
|
|
CHARARACTER_READING_MAPPING = { ' Hangul ' : ( ' CharacterHangul ' , { ...
A list of readings for which a character mapping exists including the
database's table name and the reading dialect parameters.
|
|
_strokeLookup = None
A dictionary containing stroke forms for stroke abbreviations.
|
|
IDS_BINARY = [ u' ⿰ ' , u' ⿱ ' , u' ⿴ ' , u' ⿵ ' , u' ⿶ ' , u' ⿷ ' , u' ⿸ ' , u' ⿹ ' , ...
A list of binary IDS operators used to describe character
decompositions.
|
|
IDS_TRINARY = [ u' ⿲ ' , u' ⿳ ' ]
A list of trinary IDS operators used to describe character
decompositions.
|
__init__(self,
databaseUrl=None,
dbConnectInst=None)
(Constructor)
| source code
|
Initialises the CharacterLookup.
If no parameters are given default values are assumed for the
connection to the database. The database connection parameters can be
given in databaseUrl, or an instance of DatabaseConnector can be passed in dbConnectInst, the
latter one being preferred if both are specified.
- Parameters:
databaseUrl (str) - database connection setting in the format
driver://user:pass@host/database .
dbConnectInst (instance) - instance of a DatabaseConnector
|
getCharactersForReading(self,
readingString,
readingN,
**options)
| source code
|
Gets all know characters for the given reading.
- Parameters:
readingString (str) - reading string for lookup
readingN (str) - name of reading
options - additional options for handling the reading input
- Returns: list of str
- list of characters for the given reading
- Raises:
UnsupportedError - if no mapping between characters and target reading exists.
ConversionError - if conversion from the internal source reading to the given target
reading fails.
|
getReadingForCharacter(self,
char,
readingN,
**options)
| source code
|
Gets all know readings for the character in the given target
reading.
- Parameters:
char (str) - Chinese character for lookup
readingN (str) - name of target reading
options - additional options for handling the reading output
- Returns: str
- list of readings for the given character
- Raises:
UnsupportedError - if no mapping between characters and target reading exists.
ConversionError - if conversion from the internal source reading to the given target
reading fails.
|
_getCompatibleCharacterReading(self,
readingN,
toCharReading=True)
| source code
|
Gets a reading where a mapping from to Chinese characters is supported
and that is compatible (a conversion is supported) to the given
reading.
- Parameters:
readingN (str) - name of reading
toCharReading (bool) - True if conversion is done in direction to the given
reading, False otherwise
- Returns: str
- a reading that is compatible to the given one and where character
lookup is supported
- Raises:
|
Gets the locale search value for a database lookup on databases with
character locale dependant content.
- Parameters:
locale (str) - character locale (one out of TCJKV)
- Returns: str
- search locale used for SQL select
- Raises:
ValueError - if invalid character locale specified
To Do (Fix):
This probably requires a full table scan
|
getCharacterVariants(self,
char,
variantType)
| source code
|
Gets the variant forms of the given type for the character.
The type can be one out of:
-
C, compatible character form (if character was added to
Unicode to maintain compatibility and round-trip convertibility)
-
M, semantic variant forms, which are often used
interchangeably instead of the character.
-
P, specialised semantic variant forms, which are often used
interchangeably instead of the character but limited to certain
contexts.
-
Z, Z-variant forms, which only differ in typeface (and would
have been unified if not to maintain round trip convertibility)
-
S, simplified Chinese character forms, originating from the
character simplification process of the PR China.
-
T, traditional character forms for a simplified Chinese
character.
Variants depend on the locale which is not taken into account here.
Thus some of the returned characters might be only be variants under some
locales.
- Parameters:
char (str) - Chinese character
variantType (str) - type of variant(s) to be returned
- Returns: list of str
- list of character variant(s) of given type
To Do (Lang):
What is the difference on Z-variants and compatible variants? Some
links between two characters are bidirectional, some not. Is there any
rule?
To Do (Impl):
Give a source on variant information as information can contradict
itself (http://www.unicode.org/reports/tr38/tr38-5.html#N10211).
See 呆 (U+5446) which has one form each for semantic and specialised
semantic, each derived from a different source. Change also in getAllCharacterVariants().
To Do (Docu):
Write about different kinds of variants
|
Gets all variant forms regardless of the type for the character.
A list of tuples is returned, including the character and its variant
type. See getCharacterVariants() for variant types.
Variants depend on the locale which is not taken into account here.
Thus some of the returned characters might be only be variants under some
locales.
- Parameters:
char (str) - Chinese character
- Returns: list of tuple
- list of character variant(s) with their type
|
getLocaleDefaultZVariant(self,
char,
locale)
| source code
|
Gets the default Z-variant for the given character under the given
locale.
The Z-variant returned is an index to the internal database of
different character glyphs and represents the most common glyph used
under the given locale.
- Parameters:
char (str) - Chinese character
locale (str) - character locale (one out of TCJKV)
- Returns: int
- Z-variant
- Raises:
NoInformationError - if no Z-variant information is available
ValueError - if invalid character locale specified
|
Gets a list of character Z-variant indices (glyphs) supported by the
database.
A Z-variant index specifies a particular character glyph which is
needed by several glyph-dependant methods instead of the abstract
character defined by Unicode.
- Parameters:
char (str) - Chinese character
- Returns: list of int
- list of supported Z-variants
- Raises:
|
getStrokeCount(self,
char,
locale=None,
zVariant=0)
| source code
|
Gets the stroke count for the given character.
- Parameters:
char (str) - Chinese character
locale (str) - character locale (one out of TCJKV). Giving the locale
will apply the default Z-variant defined by getLocaleDefaultZVariant(). The Z-variant
supplied with option zVariant will be ignored.
zVariant (int) - Z-variant of the first character
- Returns: int
- stroke count of given character
- Raises:
NoInformationError - if no stroke count information available
ValueError - if an invalid character locale is specified
Attention:
The quality of the returned data depends on the sources used when
compiling the database. Unihan itself only gives very general
stroke order information without being bound to a specific glyph.
|
Gets the full stroke count table from the database.
- Returns: dict
- dictionary of key pair character, Z-variant and value stroke
count
Attention:
The quality of the returned data depends on the sources used when
compiling the database. Unihan itself only gives very general
stroke order information without being bound to a specific glyph.
|
Gets the stroke form for the given abbreviated name (e.g. 'HZ').
- Parameters:
abbrev (str) - abbreviated stroke name
- Returns: str
- Unicode stroke character
- Raises:
ValueError - if invalid stroke abbreviation is specified
|
Gets the stroke form for the given name (e.g. '横折').
- Parameters:
name (str) - Chinese name of stroke
- Returns: str
- Unicode stroke char
- Raises:
ValueError - if invalid stroke name is specified
|
getStrokeOrder(self,
char,
locale=None,
zVariant=0)
| source code
|
Gets the stroke order sequence for the given character.
The stroke order is constructed using the character decomposition into
components. As the stroke order information for some components might be
not obtainable the returned stroke order might be partial.
- Parameters:
char (str) - Chinese character
locale (str) - character locale (one out of TCJKV). Giving the locale
will apply the default Z-variant defined by getLocaleDefaultZVariant(). The Z-variant
supplied with option zVariant will be ignored.
zVariant (int) - Z-variant of the first character
- Returns: str
- string of stroke abbreviations separated by spaces and hyphens.
- Raises:
ValueError - if an invalid character locale is specified
NoInformationError - if no stroke order information available
To Do (Lang):
Add stroke order source to stroke order data so that in general
different and contradicting stroke order information can be given. The
user then could prefer several sources that in the order given would be
queried.
|
Gets the Kangxi radical index for the given character as defined by
the Unihan database.
- Parameters:
char (str) - Chinese character
- Returns: int
- Kangxi radical index
- Raises:
|
getCharacterKangxiRadicalResidualStrokeCount(self,
char,
locale=None,
zVariant=0)
| source code
|
Gets the Kangxi radical form (either a Unicode radical form or
a Unicode radical variant) found as a component in the character
and the stroke count of the residual character components.
The representation of the included radical or radical variant form
depends on the respective character variant and thus the form's Z-variant
is returned. Some characters include the given radical more than once and
in some cases the representation is different between those same forms
thus in the general case several matches can be returned each entry with
a different radical form Z-variant. In these cases the entries are sorted
by their Z-variant.
There are characters which include both, the radical form and a
variant form of the radical (e.g. 伦: 人 and 亻). In these cases both are
returned.
This method will return radical forms regardless of the selected
locale, e.g. radical ⻔ is returned for character 间, though this variant
form is not recognised under a traditional locale (like the character
itself).
- Parameters:
char (str) - Chinese character
locale (str) - character locale (one out of TCJKV). Giving the locale
will apply the default Z-variant defined by getLocaleDefaultZVariant(). The Z-variant
supplied with option zVariant will be ignored.
zVariant (int) - Z-variant of the first character
- Returns: list of tuple
- list of radical/variant form, its Z-variant, the main layout of
the character (using a IDS operator), the position of the
radical wrt. layout (0, 1 or 2) and the residual stroke count.
- Raises:
NoInformationError - if no stroke count information available
ValueError - if an invalid character locale is specified
|
getCharacterRadicalResidualStrokeCount(self,
char,
radicalIndex,
locale=None,
zVariant=0)
| source code
|
Gets the radical form (either a Unicode radical form or a
Unicode radical variant) found as a component in the character and
the stroke count of the residual character components.
This is a more general version of getCharacterKangxiRadicalResidualStrokeCount() which is
not limited to the mapping of characters to a Kangxi radical as done by
Unihan.
- Parameters:
char (str) - Chinese character
radicalIndex (int) - radical index
locale (str) - character locale (one out of TCJKV). Giving the locale
will apply the default Z-variant defined by getLocaleDefaultZVariant(). The Z-variant
supplied with option zVariant will be ignored.
zVariant (int) - Z-variant of the first character
- Returns: list of tuple
- list of radical/variant form, its Z-variant, the main layout of
the character (using a IDS operator), the position of the
radical wrt. layout (0, 1 or 2) and the residual stroke count.
- Raises:
NoInformationError - if no stroke count information available
ValueError - if an invalid character locale is specified
To Do (Lang):
-
Clarify on characters classified under a given radical but without any
proper radical glyph found as component.
-
Clarify on different radical zVariants for the same radical form. At
best this method should return one and only one radical form (glyph).
To Do (Impl):
Give the Unicode radical form and not the equivalent character
form in the relevant table as to always return the pure radical form
(also avoids duplicates). Then state:
If the included component has an appropriate Unicode radical
form or Unicode radical variant, then this form is returned.
In either case the radical form can be an ordinary character.
|
getCharacterRadicalResidualStrokeCountDict(self)
| source code
|
Gets the full table of radical forms (either a Unicode radical
form or a Unicode radical variant) found as a component in the
character and the stroke count of the residual character components from
the database.
A typical entry looks like (u'众', 0): {9: [(u'人', 0, u'⿱', 0,
4), (u'人', 0, u'⿻', 0, 4)]} , and can be accessed as
radicalDict[(u'众', 0)][9] with the Chinese character, its
Z-variant and Kangxi radical index. The values are given in the order
radical form, radical Z-variant, character layout,
relative position of the radical and finally the residual
stroke count.
- Returns: dict
- dictionary of radical/residual stroke count entries.
|
getCharacterKangxiResidualStrokeCount(self,
char,
locale=None,
zVariant=0)
| source code
|
Gets the stroke count of the residual character components when
leaving aside the radical form.
This method returns a subset of data with regards to getCharacterKangxiRadicalResidualStrokeCount(). It may
though offer more entries after all, as their might exists information
only about the residual stroke count, but not about the concrete radical
form.
- Parameters:
char (str) - Chinese character
locale (str) - character locale (one out of TCJKV). Giving the locale
will apply the default Z-variant defined by getLocaleDefaultZVariant(). The Z-variant
supplied with option zVariant will be ignored.
zVariant (int) - Z-variant of the first character
- Returns: int
- residual stroke count
- Raises:
NoInformationError - if no stroke count information available
ValueError - if an invalid character locale is specified
Attention:
The quality of the returned data depends on the sources used when
compiling the database. Unihan itself only gives very general
stroke order information without being bound to a specific glyph.
|
getCharacterResidualStrokeCount(self,
char,
radicalIndex,
locale=None,
zVariant=0)
| source code
|
Gets the stroke count of the residual character components when
leaving aside the radical form.
This is a more general version of getCharacterKangxiResidualStrokeCount() which is not
limited to the mapping of characters to a Kangxi radical as done by
Unihan.
- Parameters:
char (str) - Chinese character
radicalIndex (int) - radical index
locale (str) - character locale (one out of TCJKV). Giving the locale
will apply the default Z-variant defined by getLocaleDefaultZVariant(). The Z-variant
supplied with option zVariant will be ignored.
zVariant (int) - Z-variant of the first character
- Returns: int
- residual stroke count
- Raises:
NoInformationError - if no stroke count information available
ValueError - if an invalid character locale is specified
Attention:
The quality of the returned data depends on the sources used when
compiling the database. Unihan itself only gives very general
stroke order information without being bound to a specific glyph.
|
Gets the full table of stroke counts of the residual character
components from the database.
A typical entry looks like (u'众', 0): {9: [4]} , and can
be accessed as residualCountDict[(u'众', 0)][9] with the
Chinese character, its Z-variant and Kangxi radical index which then
gives the residual stroke count.
- Returns: dict
- dictionary of radical/residual stroke count entries.
|
getCharactersForKangxiRadicalIndex(self,
radicalIndex)
| source code
|
Gets all characters for the given Kangxi radical index.
- Parameters:
radicalIndex (int) - Kangxi radical index
- Returns: list of str
- list of matching Chinese characters
To Do (Lang):
6954 characters have no Kangxi radical. Provide integration for these
(SELECT COUNT(*) FROM Unihan WHERE kRSUnicode IS NOT NULL AND kRSKangxi
IS NULL;).
To Do (Docu):
Write about how Unihan maps characters to a Kangxi radical. Especially
Chinese simplified characters.
|
getCharactersForRadicalIndex(self,
radicalIndex)
| source code
|
Gets all characters for the given radical index.
This is a more general version of getCharactersForKangxiRadicalIndex() which is not
limited to the mapping of characters to a Kangxi radical as done by
Unihan and one character can show up under several different radical
indices.
- Parameters:
radicalIndex (int) - Kangxi radical index
- Returns: list of str
- list of matching Chinese characters
|
getResidualStrokeCountForKangxiRadicalIndex(self,
radicalIndex)
| source code
|
Gets all characters and residual stroke count for the given Kangxi
radical index.
This brings together methods getCharactersForKangxiRadicalIndex() and getCharacterResidualStrokeCountDict() and reports all
characters including the given Kangxi radical, additionally supplying the
residual stroke count.
- Parameters:
radicalIndex (int) - Kangxi radical index
- Returns: list of tuple
- list of matching Chinese characters with residual stroke count
|
getResidualStrokeCountForRadicalIndex(self,
radicalIndex)
| source code
|
Gets all characters and residual stroke count for the given radical
index.
This brings together methods getCharactersForRadicalIndex() and getCharacterResidualStrokeCountDict() and reports all
characters including the given radical without being limited to the
mapping of characters to a Kangxi radical as done by Unihan, additionally
supplying the residual stroke count.
- Parameters:
radicalIndex (int) - Kangxi radical index
- Returns: list of tuple
- list of matching Chinese characters with residual stroke count
|
getKangxiRadicalForm(self,
radicalIdx,
locale)
| source code
|
Gets a Unicode radical form for the given Kangxi radical
index.
This method will always return a single non null value, even if there
are several radical forms for one index.
- Parameters:
radicalIdx (int) - Kangxi radical index
locale (str) - character locale (one out of TCJKV)
- Returns: str
- Unicode radical form
- Raises:
ValueError - if an invalid character locale or radical index is specified
To Do (Lang):
Check if radicals for which multiple radical forms exists include a
simplified form or other variation (e.g. ⻆, ⻝, ⺐). There are radicals
for which a Chinese simplified character equivalent exists and that is
mapped to a different radical under Unicode.
|
getKangxiRadicalVariantForms(self,
radicalIdx,
locale)
| source code
|
Gets a list of Unicode radical variants for the given Kangxi
radical index.
This method can return an empty list if there are no Unicode
radical variant forms. There might be non Unicode radical
variants for this radial as character forms though.
- Parameters:
radicalIdx (int) - Kangxi radical index
locale (str) - character locale (one out of TCJKV)
- Returns: list of str
- list of Unicode radical variants
- Raises:
ValueError - if an invalid character locale is specified
To Do (Lang):
Narrow locales, not all variant forms are valid under all locales.
|
getKangxiRadicalIndex(self,
radicalForm,
locale=None)
| source code
|
Gets the Kangxi radical index for the given form.
The given form might either be an Unicode radical form or an
equivalent character.
If there is an entry for the given radical form it still might not be
a radical under the given character locale. So specifying a locale allows
strict radical handling.
- Parameters:
radicalForm (str) - radical form
locale (str) - optional character locale (one out of TCJKV)
- Returns: int
- Kangxi radical index
- Raises:
ValueError - if invalid character locale or radical form is specified
|
getKangxiRadicalRepresentativeCharacters(self,
radicalIdx,
locale)
| source code
|
Gets a list of characters that represent the radical for the given
Kangxi radical index.
This includes the radical form(s), character equivalents and variant
forms and equivalents.
E.g. character for to speak/to say/talk/word (Pinyin
yán): ⾔ (0x2f94), 言 (0x8a00), ⻈ (0x2ec8), 讠 (0x8ba0), 訁
(0x8a01)
- Parameters:
radicalIdx (int) - Kangxi radical index
locale (str) - character locale (one out of TCJKV)
- Returns: list of str
- list of Chinese characters representing the radical for the given
index, including Unicode radical and variant forms and their
equivalent real character forms
- Raises:
ValueError - if invalid character locale specified
|
isKangxiRadicalFormOrEquivalent(self,
form,
locale=None)
| source code
|
Checks if the given form is a Kangxi radical form or a radical
equivalent. This includes Unicode radical forms, Unicode
radical variants, equivalent character and isolated radical
characters.
If there is an entry for the given radical form it still might not be
a radical under the given character locale. So specifying a locale allows
strict radical handling.
- Parameters:
form (str) - Chinese character
locale (str) - optional character locale (one out of TCJKV)
- Returns: bool
True if given form is a radical or equivalent
character, False otherwise
- Raises:
ValueError - if an invalid character locale is specified
|
Checks if the given character is a Unicode radical form or
Unicode radical variant.
This method does a quick Unicode code index checking. So there is no
guarantee this form has actually a radical entry in the database.
- Parameters:
char (str) - Chinese character
- Returns: bool
True if given form is a radical form,
False otherwise
|
getRadicalFormEquivalentCharacter(self,
radicalForm,
locale)
| source code
|
Gets the equivalent character of the given Unicode radical
form or Unicode radical variant.
The mapping mostly follows the Han Radical
folding specified in the Draft Unicode Technical Report #30 Character
Foldings under http://www.unicode.org/unicode/reports/tr30/#HanRadicalFolding.
All radical forms except U+2E80 (⺀) have an equivalent character. These
equivalent characters are not necessarily visual identical and can be
subject to major variation.
This method may raise a UnsupportedError if there is no supported
equivalent character form.
- Parameters:
radicalForm (str) - Unicode radical form
locale (str) - character locale (one out of TCJKV)
- Returns: str
- equivalent character form
- Raises:
UnsupportedError - if there is no supported equivalent character form
ValueError - if invalid character locale or radical form is specified
|
getCharacterEquivalentRadicalForms(self,
equivalentForm,
locale)
| source code
|
Gets Unicode radical forms or Unicode radical variants
for the given equivalent character.
The mapping mostly follows the Han Radical folding specified in
the Draft Unicode Technical Report #30 Character Foldings
under http://www.unicode.org/unicode/reports/tr30/#HanRadicalFolding.
Several radical forms can be mapped to the same equivalent character and
thus this method in general returns several values.
- Parameters:
equivalentForm (str) - Equivalent character of Unicode radical form or Unicode
radical variant
locale (str) - character locale (one out of TCJKV)
- Returns: list of str
- equivalent character forms
- Raises:
ValueError - if invalid character locale or equivalent character is
specified
|
isBinaryIDSOperator(cls,
char)
Class Method
| source code
|
Checks if given character is a binary IDS operator.
- Parameters:
char (str) - Chinese character
- Returns: bool
True if binary IDS operator,
False otherwise
|
isTrinaryIDSOperator(cls,
char)
Class Method
| source code
|
Checks if given character is a trinary IDS operator.
- Parameters:
char (str) - Chinese character
- Returns: bool
True if trinary IDS operator,
False otherwise
|
Checks if given character is an IDS operator.
- Parameters:
char (str) - Chinese character
- Returns: bool
True if IDS operator, False
otherwise
|
getCharactersForComponents(self,
componentList,
locale,
includeEquivalentRadicalForms=True,
resultIncludeRadicalForms=False)
| source code
|
Gets all characters that contain the given components.
If option includeEquivalentRadicalForms is set, all
equivalent forms will be search for when a Kangxi radical is given.
- Parameters:
componentList (list of str) - list of character components
locale (str) - character locale (one out of TCJKV)
includeEquivalentRadicalForms (bool) - if True then characters in the given component list
are interpreted as representatives for their radical and all
radical forms are included in the search. E.g. 肉 will include ⺼
as a possible component.
resultIncludeRadicalForms (bool) - if True the result will include Unicode radical
forms and Unicode radical variants
- Returns: list of tuple
- list of pairs of matching characters and their Z-variants
- Raises:
ValueError - if an invalid character locale is specified
To Do (Lang):
By default we get the equivalent character for a radical form. In some
cases these equivalent characters will be only abstractly related to
the given radical form (e.g. being the main radical form), so that the
result set will be too big and doesn't reflect the original query. Set
up a table including only strict visual relations between radical forms
and equivalent characters. Alternatively restrict decomposition data to
only include radical forms if appropriate, so there would be no need
for conversion.
To Do (Data):
-
Adopt locale dependant Z-variants for parent characters (e.g. 鬼 in 隗 愧
嵬).
-
Use radical forms and radical variant forms instead of equivalent
characters in decomposition data. Mapping looses information.
To Do (Impl):
Table of same character glyphs, including special radical forms (e.g. 言
and 訁).
|
getCharactersForEquivalentComponents(self,
componentConstruct,
locale=None,
resultIncludeRadicalForms=False)
| source code
|
Gets all characters that contain at least one component per list
entry, sorted by stroke count if available.
This is the general form of getCharactersForComponents() and allows a set of
characters per list entry of which at least one character must be a
component in the given list.
If a character locale is specified only characters will be
returned for which the locale's default Z-variant's decomposition
will apply to the given components. Otherwise all Z-variants will be
considered.
- Parameters:
componentConstruct (list of list of str) - list of character components given as single characters or, for
alternative characters, given as a list
resultIncludeRadicalForms (bool) - if True the result will include Unicode radical
forms and Unicode radical variants
locale (str) - character locale (one out of TCJKV)
- Returns: list of tuple
- list of pairs of matching characters and their Z-variants
- Raises:
ValueError - if an invalid character locale is specified
|
getDecompositionEntries(self,
char,
locale=None,
zVariant=0)
| source code
|
Gets the decomposition of the given character into components from the
database. The resulting decomposition is only the first layer in a tree
of possible paths along the decomposition as the components can be
further subdivided.
There can be several decompositions for one character so a list of
decomposition is returned.
Each entry in the result list consists of a list of characters (with
its Z-variant) and IDS operators.
- Parameters:
char (str) - Chinese character that is to be decomposed into components
locale (str) - character locale (one out of TCJKV). Giving the locale
will apply the default Z-variant defined by getLocaleDefaultZVariant(). The Z-variant
supplied with option zVariant will be ignored.
zVariant (int) - Z-variant of the first character
- Returns: list
- list of first layer decompositions
- Raises:
ValueError - if an invalid character locale is specified
|
Gets the full decomposition table from the database.
- Returns: dict
- dictionary with key pair character, Z-variant and the first layer
decomposition as value
|
_getDecompositionFromString(self,
decomposition)
| source code
|
Gets a tuple representation with character/Z-variant of the given
character's decomposition into components.
Example: Entry ⿱尚[1]儿 will be returned as [u'⿱',
(u'尚', 1), (u'儿', 0)] .
- Parameters:
decomposition (str) - character decomposition with IDS operator, compontens and
optional Z-variant index
- Returns: list
- decomposition with character/Z-variant tuples
|
getDecompositionTreeList(self,
char,
locale=None,
zVariant=0)
| source code
|
Gets the decomposition of the given character into components as a
list of decomposition trees.
There can be several decompositions for one character so one tree per
decomposition is returned.
Each entry in the result list consists of a list of characters (with
its Z-variant and list of further decomposition) and IDS operators. If a
character can be further subdivided, its containing list is non empty and
includes yet another list of trees for the decomposition of the
component.
- Parameters:
char (str) - Chinese character that is to be decomposed into components
locale (str) - character locale (one out of TCJKV). Giving the locale
will apply the default Z-variant defined by getLocaleDefaultZVariant(). The Z-variant
supplied with option zVariant will be ignored.
zVariant (int) - Z-variant of the first character
- Returns: list
- list of decomposition trees
- Raises:
ValueError - if an invalid character locale is specified
|
isComponentInCharacter(self,
component,
char,
locale=None,
zVariant=0,
componentZVariant=None)
| source code
|
Checks if the given character contains the second character as a
component.
- Parameters:
component (str) - character questioned to be a component
char (str) - Chinese character
locale (str) - character locale (one out of TCJKV). Giving the locale
will apply the default Z-variant defined by getLocaleDefaultZVariant(). The Z-variant
supplied with option zVariant will be ignored.
zVariant (int) - Z-variant of the first character
componentZVariant (int) - Z-variant of the component; if left out every Z-variant matches
for that character.
- Returns: bool
True if component is a component of the
given character, False otherwise
- Raises:
ValueError - if an invalid character locale is specified
To Do (Impl):
Implement means to check if the component is really not found, or if
our data is just insufficient.
|
CHARARACTER_READING_MAPPING
A list of readings for which a character mapping exists including the
database's table name and the reading dialect parameters.
On conversion the first matching reading will be selected, so
supplying several equivalent readings has limited use.
- Value:
{ ' Hangul ' : ( ' CharacterHangul ' , { } ) ,
' Jyutping ' : ( ' CharacterJyutping ' , { ' case ' : ' lower ' } ) ,
' Pinyin ' : ( ' CharacterPinyin ' ,
{ ' case ' : ' lower ' , ' toneMarkType ' : ' Numbers ' } ) }
|
|
IDS_BINARY
A list of binary IDS operators used to describe character
decompositions.
- Value:
[ u' ⿰ ' , u' ⿱ ' , u' ⿴ ' , u' ⿵ ' , u' ⿶ ' , u' ⿷ ' , u' ⿸ ' , u' ⿹ ' , u' ⿺ ' , u' ⿻ ' ]
|
|