cjklib.characterlookup

1 #!/usr/bin/python 2 # -*- coding: utf-8 -*- 3 # This file is part of cjklib. 4 # 5 # cjklib is free software: you can redistribute it and/or modify 6 # it under the terms of the GNU Lesser General Public License as published by 7 # the Free Software Foundation, either version 3 of the License, or 8 # (at your option) any later version. 9 # 10 # cjklib is distributed in the hope that it will be useful, 11 # but WITHOUT ANY WARRANTY; without even the implied warranty of 12 # MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the 13 # GNU Lesser General Public License for more details. 14 # 15 # You should have received a copy of the GNU Lesser General Public License 16 # along with cjklib. If not, see <http://www.gnu.org/licenses/>. 17 18 """ 19 Provides the central Chinese character based functions. 20 """ 21 22 # import math 23 from sqlalchemy import select, union 24 from sqlalchemy.sql import and_, or_, not_ 25 26 from cjklib import reading 27 from cjklib import exception 28 from cjklib import dbconnector

29 30 -class CharacterLookup:

31 u""" 32 CharacterLookup provides access to lookup methods related to Han characters. 33 34 The real system of CharacterLookup lies in the database beneath where all 35 relevant data is stored. So for nearly all methods this class needs access 36 to a database. Thus on initialisation of the object a connection to a 37 database is established, the logic for this provided by the 38 L{DatabaseConnector}. 39 40 See the L{DatabaseConnector} for supported database systems. 41 42 CharacterLookup will try to read the config file from either /etc or the 43 users home folder. If none is present it will try to open a SQLite database 44 stored as C{db} in the same folder by default. You can override this 45 behaviour by specifying additional parameters on creation of the object. 46 47 Examples 48 ======== 49 The following examples should give a quick view into how to use this 50 package. 51 - Create the CharacterLookup object with default settings 52 (read from cjklib.conf or 'cjklib.db' in same directory as default): 53 54 >>> from cjklib import characterlookup 55 >>> cjk = characterlookup.CharacterLookup() 56 57 - Get a list of characters, that are pronounced "국" in Korean: 58 59 >>> cjk.getCharactersForReading(u'국', 'Hangul') 60 [u'匊', u'國', u'局', u'掬', u'菊', u'跼', u'鞠', u'鞫', u'麯', u'麴'] 61 62 - Check if a character is included in another character as a component: 63 64 >>> cjk.isComponentInCharacter(u'女', u'好') 65 True 66 67 - Get all Kangxi radical variants for Radical 184 (⾷) under the 68 traditional locale: 69 70 >>> cjk.getKangxiRadicalVariantForms(184, 'T') 71 [u'\u2ede', u'\u2edf'] 72 73 X{Character locale} 74 =================== 75 During the development of characters in the different cultures character 76 appearances changed over time to that extent, that the handling of radicals, 77 character components and strokes needs to be distinguished, depending on the 78 locale. 79 80 To deal with this circumstance I{CharacterLookup} works with a character 81 locale. Most of the methods of this class ask for a locale to be specified. 82 In these cases the output of the method depends on the specified locale. 83 84 For example in the traditional locale 这 has 8 strokes, but in 85 simplified Chinese it has only 7, as the radical ⻌ has different stroke 86 counts, depending on the locale. 87 88 X{Z-variant}s 89 ============= 90 One feature of Chinese characters is the glyph form describing the visual 91 representation. This feature doesn't need to be unique and so many 92 characters can be found in different writing variants e.g. character 福 93 (English: luck) which has numerous forms. 94 95 The Unicode Consortium does not include same characters of different 96 actual shape in the Unicode standard (called I{Z-variant}s), except a few 97 "double" entries which are included as to maintain backward compatibility. 98 In fact a code point represents an abstract character not defining any 99 visual representation. Thus a distinct appearance description including 100 strokes and stroke order cannot be simply assigned to a code point but one 101 needs to deal with the notion of I{Z-variants} representing distinct glyphs 102 to which a visual description can be applied. 103 104 The name Z-variant is derived from the three-dimensional model representing 105 the space of characters relative to three axis, being the X axis 106 representing the semantic space, the Y axis representing the abstract shape 107 space and finally the Z axis for typeface differences (see "Principles of 108 Han Unification" in: The Unicode Standard 5.0, chapter 12). Character 109 presentations only differing in the Z dimension are generally unified. 110 111 cjklib tries to offer a simple approach to handle different Z-variants. As 112 character components, strokes and the stroke order depend on this variant, 113 methods dealing with this kind will ask for a I{Z-variant} value to be 114 specified. In these cases the output of the method depends on the specified 115 variant. 116 117 Z-variants and character locales 118 -------------------------------- 119 Deviant stroke count, stroke order or decomposition into character 120 components for different I{character locales} is implemented using different 121 I{Z-variant}s. For the example given above the entry 这 with 8 strokes is 122 given as one Z-variant and the form with 7 strokes is given as another 123 Z-variant. 124 125 In most cases one might only be interested in a single visual appearance, 126 the "standard" one. This visual appearance would be the one generally used 127 in the specific locale. 128 129 Instead of specifying a certain Z-variant most functions will allow for 130 passing of a character locale. Giving the locale will apply the default 131 Z-variant given by the mapping defined in the database which can be obtained 132 by calling L{getLocaleDefaultZVariant()}. 133 134 More complex relations as which of several Z-variants for a given character 135 are used in a given locale are not covered. 136 137 Kangxi radical functions 138 ======================== 139 Using the Unihan database queries about the Kangxi radical of characters can 140 be made. 141 It is possible to get a Kangxi radical for a character or lookup all 142 characters for a given radical. 143 144 Unicode has extra code points for radical forms (e.g. ⾔), here called 145 X{Unicode radical form}s, and radical variant forms (e.g. ⻈), here called 146 X{Unicode radical variant}s. These characters should be used when explicitly 147 referring to their function as radicals. 148 For most of the radicals and variants their exist complementary character 149 forms which have the same appearance (e.g. 言 and 讠) and which shall be 150 called X{equivalent character}s here. 151 152 Mapping from one to another side is not trivially possible, as some forms 153 only exist as radical forms, some only as character forms, but from their 154 meaning used in the radical context (called X{isolated radical character}s 155 here, e.g. 訁 for Kangxi radical 149). 156 157 Additionally a one to one mapping can't be guaranteed, as some forms have 158 two or more equivalent forms in another domain, and mapping is highly 159 dependant on the locale. 160 161 CharacterLookup provides methods for dealing with this different kinds of 162 characters and the mapping between them. 163 164 X{Character decomposition} 165 ========================== 166 Many characters can be decomposed into two or more components, that again 167 are Chinese characters. This fact can be used in many ways, including 168 character lookup, finding patterns for font design or studying characters. 169 Even the stroke order and stroke count can be deduced from the stroke 170 information of the character's components. 171 172 Character decomposition is highly dependant on the appearance of the 173 character, so both I{Z-variant} and I{character locale} need to be clear 174 when looking at a decomposition into components. 175 176 More points render this task more complex: decomposition into one set of 177 components is not distinct, some characters can be broken down into 178 different sets. Furthermore sometimes one component can be given, but the 179 other component will not be encoded as a character in its own right. 180 181 These components again might be characters that contain further components 182 (again not distinct ones), thus a complex decomposition in several steps is 183 possible. 184 185 The basis for the character decomposition lies in the database, where all 186 decompositions are stored, using X{Ideographic Description Sequence}s 187 (I{IDS}). These sequences consist of Unicode X{IDS operator}s and characters 188 to describe the structure of the character. There are 189 X{binary IDS operator}s to describe decomposition into two components (e.g. 190 ⿰ for one component left, one right as in 好: ⿰女子) or 191 X{trinary IDS operator}s for decomposition into three components (e.g. ⿲ 192 for three components from left to right as in 辨: ⿲⾟刂⾟). Using 193 I{IDS operator}s it is possible to give a basic structural information, that 194 in many cases is enough for example to derive a overall stroke order from 195 two single sets of stroke orders. Further more it is possible to look for 196 redundant information in different entries and thus helps to keep the 197 definition data clean. 198 199 This class provides methods for retrieving the basic partition entries, 200 lookup of characters by components and decomposing as a tree from the 201 character as a root down to the X{minimal components} as leaf nodes. 202 203 TODO: Policy about what to classify as partition. 204 205 Strokes 206 ======= 207 Chinese characters consist of different strokes as basic parts. These 208 strokes are written in a mostly distinct order called the X{stroke order} 209 and have a distinct X{stroke count}. 210 211 The I{stroke order} in the writing of Chinese characters is important e.g. 212 for calligraphy or students learning new characters and is normally fixed as 213 there is only one possible stroke order for each character. Further more 214 there is a fixed set of possible strokes and these strokes carry names. 215 216 As with character decomposition the I{stroke order} and I{stroke count} is 217 highly dependant on the appearance of the character, so both I{Z-variant} 218 and I{character locale} need to be known. 219 220 Further more the order of strokes can be useful for lookup of characters, 221 and so CharacterLookup provides different methods for getting the stroke 222 count, stroke order, lookup of stroke names and lookup of characters by 223 stroke types and stroke order. 224 225 Most methods work with an abbreviation of stroke names using the first 226 letters of each syllable of the Chinese name in Pinyin. 227 228 The I{stroke order} is not always quite clear and even academics fight about 229 which order should be considered the correct one, a discussion that 230 shouldn't be taking lightly. This circumstance should be considered 231 when working with I{stroke order}s. 232 233 TODO: About plans of cjklib how to support different views on the stroke 234 order 235 236 TODO: About the different classifications of strokes 237 238 Readings 239 ======== 240 See module L{reading} for a detailed description. 241 242 @see: 243 - Radicals: 244 U{http://en.wikipedia.org/wiki/Radical_(Chinese_character)} 245 - Z-variants: 246 U{http://www.unicode.org/reports/tr38/tr38-5.html#N10211} 247 248 @todo Fix: Incorporate stroke lookup (bigram) techniques 249 @todo Fix: How to handle character forms (either decomposition or stroke 250 order), that can only be found as a component in other characters? We 251 already mark them by flagging it with an 'S'. 252 @todo Impl: Think about applying locale at object creation time and not 253 passing it on every method call. Would make the class easier to use. 254 @todo Impl: Create a method for specifying which character range is of 255 interest for the return values of methods. Narrowing the return results 256 is a further way to locale dependant responses. E.g. cjknife could take 257 this into account when only displaying characters that can be displayed 258 with the current locale (BIG5, GBK...). 259 @todo Lang: Add option to component decomposition methods to stop on Kangxi 260 radical forms without breaking further down beyond those. 261 """ 262 263 CHARARACTER_READING_MAPPING = {'Hangul': ('CharacterHangul', {}), 264 'Jyutping': ('CharacterJyutping', {'case': 'lower'}), 265 'Pinyin': ('CharacterPinyin', {'toneMarkType': 'Numbers', 266 'case': 'lower'}) 267 } 268 """ 269 A list of readings for which a character mapping exists including the 270 database's table name and the reading dialect parameters. 271 272 On conversion the first matching reading will be selected, so supplying 273 several equivalent readings has limited use. 274 """ 275

276 - def __init__(self, databaseUrl=None, dbConnectInst=None):

277 """ 278 Initialises the CharacterLookup. 279 280 If no parameters are given default values are assumed for the connection 281 to the database. The database connection parameters can be given in 282 databaseUrl, or an instance of L{DatabaseConnector} can be passed in 283 dbConnectInst, the latter one being preferred if both are specified. 284 285 @type databaseUrl: str 286 @param databaseUrl: database connection setting in the format 287 C{driver://user:pass@host/database}. 288 @type dbConnectInst: instance 289 @param dbConnectInst: instance of a L{DatabaseConnector} 290 """ 291 # get connector to database 292 if dbConnectInst: 293 self.db = dbConnectInst 294 else: 295 self.db = dbconnector.DatabaseConnector.getDBConnector(databaseUrl) 296 297 self.readingFactory = None 298 299 # test for existing tables that can be used to speed up look up 300 self.hasComponentLookup = self.db.engine.has_table('ComponentLookup') 301 self.hasStrokeCount = self.db.engine.has_table('StrokeCount')

302

303 - def _getReadingFactory(self):

304 """ 305 Gets the L{ReadingFactory} instance. 306 307 @rtype: instance 308 @return: a L{ReadingFactory} instance. 309 """ 310 # get reading factory 311 if not self.readingFactory: 312 self.readingFactory = reading.ReadingFactory(dbConnectInst=self.db) 313 return self.readingFactory

314 315 #{ Character reading lookup 316

317 - def getCharactersForReading(self, readingString, readingN, **options):

318 """ 319 Gets all know characters for the given reading. 320 321 @type readingString: str 322 @param readingString: reading string for lookup 323 @type readingN: str 324 @param readingN: name of reading 325 @param options: additional options for handling the reading input 326 @rtype: list of str 327 @return: list of characters for the given reading 328 @raise UnsupportedError: if no mapping between characters and target 329 reading exists. 330 @raise ConversionError: if conversion from the internal source reading 331 to the given target reading fails. 332 """ 333 # check for available mapping from Chinese characters to a compatible 334 # reading 335 compatReading = self._getCompatibleCharacterReading(readingN) 336 tableName, compatOptions \ 337 = self.CHARARACTER_READING_MAPPING[compatReading] 338 339 # translate reading form to target reading, for readingN=compatReading 340 # get standard form if supported 341 readingFactory = self._getReadingFactory() 342 if readingN != compatReading \ 343 or readingFactory.isReadingConversionSupported(readingN, readingN): 344 readingString = readingFactory.convert(readingString, readingN, 345 compatReading, sourceOptions=options, 346 targetOptions=compatOptions) 347 348 # lookup characters 349 table = self.db.tables[tableName] 350 return self.db.selectScalars(select([table.c.ChineseCharacter], 351 table.c.Reading==readingString).order_by(table.c.ChineseCharacter))

352

353 - def getReadingForCharacter(self, char, readingN, **options):

354 """ 355 Gets all know readings for the character in the given target reading. 356 357 @type char: str 358 @param char: Chinese character for lookup 359 @type readingN: str 360 @param readingN: name of target reading 361 @param options: additional options for handling the reading output 362 @rtype: str 363 @return: list of readings for the given character 364 @raise UnsupportedError: if no mapping between characters and target 365 reading exists. 366 @raise ConversionError: if conversion from the internal source reading 367 to the given target reading fails. 368 """ 369 # check for available mapping from Chinese characters to a compatible 370 # reading 371 compatReading = self._getCompatibleCharacterReading(readingN, False) 372 tableName, compatOptions \ 373 = self.CHARARACTER_READING_MAPPING[compatReading] 374 readingFactory = self._getReadingFactory() 375 376 # lookup readings 377 table = self.db.tables[tableName] 378 readings = self.db.selectScalars(select([table.c.Reading], 379 table.c.ChineseCharacter==char).order_by(table.c.Reading)) 380 381 # check if we need to convert reading 382 if compatReading != readingN \ 383 or readingFactory.isReadingConversionSupported(readingN, readingN): 384 # translate reading forms to target reading, for 385 # readingN=characterReading get standard form if supported 386 transReadings = [] 387 for readingString in readings: 388 readingString = readingFactory.convert(readingString, 389 compatReading, readingN, sourceOptions=compatOptions, 390 targetOptions=options) 391 if readingString not in transReadings: 392 transReadings.append(readingString) 393 return transReadings 394 else: 395 return readings

396

397 - def _getCompatibleCharacterReading(self, readingN, toCharReading=True):

398 """ 399 Gets a reading where a mapping from to Chinese characters is supported 400 and that is compatible (a conversion is supported) to the given reading. 401 402 @type readingN: str 403 @param readingN: name of reading 404 @type toCharReading: bool 405 @param toCharReading: C{True} if conversion is done in direction to the 406 given reading, C{False} otherwise 407 @rtype: str 408 @return: a reading that is compatible to the given one and where 409 character lookup is supported 410 @raise UnsupportedError: if no mapping between characters and target 411 reading exists. 412 """ 413 # iterate all available char-reading mappings to find a compatible 414 # reading 415 for characterReading in self.CHARARACTER_READING_MAPPING.keys(): 416 if readingN == characterReading: 417 return characterReading 418 elif toCharReading: 419 if self._getReadingFactory().isReadingConversionSupported( 420 readingN, characterReading): 421 return characterReading 422 elif not toCharReading: 423 if self._getReadingFactory().isReadingConversionSupported( 424 characterReading, readingN): 425 return characterReading 426 raise exception.UnsupportedError("reading '" + readingN \ 427 + "' not supported for character lookup")

428 429 #} 430

431 - def _locale(self, locale):

432 """ 433 Gets the locale search value for a database lookup on databases with 434 I{character locale} dependant content. 435 436 @type locale: str 437 @param locale: I{character locale} (one out of TCJKV) 438 @rtype: str 439 @return: search locale used for SQL select 440 @raise ValueError: if invalid I{character locale} specified 441 @todo Fix: This probably requires a full table scan 442 """ 443 locale = locale.upper() 444 if not locale in set('TCJKV'): 445 raise ValueError("'" + locale + "' is not a valid character locale") 446 return '%' + locale + '%'

447 448 #{ Character variant lookup 449

450 - def getCharacterVariants(self, char, variantType):

451 """ 452 Gets the variant forms of the given type for the character. 453 454 The type can be one out of: 455 - C, I{compatible character} form (if character was added to Unicode 456 to maintain compatibility and round-trip convertibility) 457 - M, I{semantic variant} forms, which are often used interchangeably 458 instead of the character. 459 - P, I{specialised semantic variant} forms, which are often used 460 interchangeably instead of the character but limited to certain 461 contexts. 462 - Z, I{Z-variant} forms, which only differ in typeface (and would 463 have been unified if not to maintain round trip convertibility) 464 - S, I{simplified Chinese character} forms, originating from the 465 character simplification process of the PR China. 466 - T, I{traditional character} forms for a 467 I{simplified Chinese character}. 468 469 Variants depend on the locale which is not taken into account here. Thus 470 some of the returned characters might be only be variants under some 471 locales. 472 473 @type char: str 474 @param char: Chinese character 475 @type variantType: str 476 @param variantType: type of variant(s) to be returned 477 @rtype: list of str 478 @return: list of character variant(s) of given type 479 480 @todo Docu: Write about different kinds of variants 481 @todo Impl: Give a source on variant information as information can 482 contradict itself 483 (U{http://www.unicode.org/reports/tr38/tr38-5.html#N10211}). See 484 呆 (U+5446) which has one form each for semantic and specialised 485 semantic, each derived from a different source. Change also in 486 L{getAllCharacterVariants()}. 487 @todo Lang: What is the difference on Z-variants and 488 compatible variants? Some links between two characters are 489 bidirectional, some not. Is there any rule? 490 """ 491 variantType = variantType.upper() 492 if not variantType in set('CMPZST'): 493 raise ValueError("'" + variantType \ 494 + "' is not a valid variant type") 495 496 table = self.db.tables['CharacterVariant'] 497 return self.db.selectScalars(select([table.c.Variant], 498 and_(table.c.ChineseCharacter == char, 499 table.c.Type == variantType)).order_by(table.c.Variant))

500

501 - def getAllCharacterVariants(self, char):

502 """ 503 Gets all variant forms regardless of the type for the character. 504 505 A list of tuples is returned, including the character and its variant 506 type. See L{getCharacterVariants()} for variant types. 507 508 Variants depend on the locale which is not taken into account here. Thus 509 some of the returned characters might be only be variants under some 510 locales. 511 512 @type char: str 513 @param char: Chinese character 514 @rtype: list of tuple 515 @return: list of character variant(s) with their type 516 """ 517 table = self.db.tables['CharacterVariant'] 518 return self.db.selectRows(select([table.c.Variant, table.c.Type], 519 table.c.ChineseCharacter == char).order_by(table.c.Variant))

520

521 - def getLocaleDefaultZVariant(self, char, locale):

522 """ 523 Gets the default Z-variant for the given character under the given 524 locale. 525 526 The Z-variant returned is an index to the internal database of different 527 character glyphs and represents the most common glyph used under the 528 given locale. 529 530 @type char: str 531 @param char: Chinese character 532 @type locale: str 533 @param locale: I{character locale} (one out of TCJKV) 534 @rtype: int 535 @return: Z-variant 536 @raise NoInformationError: if no Z-variant information is available 537 @raise ValueError: if invalid I{character locale} specified 538 """ 539 table = self.db.tables['LocaleCharacterVariant'] 540 zVariant = self.db.selectScalar(select([table.c.ZVariant], 541 and_(table.c.ChineseCharacter == char, 542 table.c.Locale.like(self._locale(locale))))\ 543 .order_by(table.c.ZVariant)) 544 545 if zVariant != None: 546 return zVariant 547 else: 548 # if no entry given, assume default 549 return self.getCharacterZVariants(char)[0]

550

551 - def getCharacterZVariants(self, char):

552 """ 553 Gets a list of character Z-variant indices (glyphs) supported by the 554 database. 555 556 A Z-variant index specifies a particular character glyph which is needed 557 by several glyph-dependant methods instead of the abstract character 558 defined by Unicode. 559 560 @type char: str 561 @param char: Chinese character 562 @rtype: list of int 563 @return: list of supported Z-variants 564 @raise NoInformationError: if no Z-variant information is available 565 """ 566 # return all known variant indices, order to be deterministic 567 table = self.db.tables['ZVariants'] 568 result = self.db.selectScalars(select([table.c.ZVariant], 569 table.c.ChineseCharacter == char).order_by(table.c.ZVariant)) 570 if not result: 571 raise exception.NoInformationError( 572 "No Z-variant information available for '" + char + "'") 573 574 return result

575 576 #} 577 #{ Character stroke functions 578

579 - def getStrokeCount(self, char, locale=None, zVariant=0):

580 """ 581 Gets the stroke count for the given character. 582 583 @type char: str 584 @param char: Chinese character 585 @type locale: str 586 @param locale: I{character locale} (one out of TCJKV). Giving the locale 587 will apply the default I{Z-variant} defined by 588 L{getLocaleDefaultZVariant()}. The Z-variant supplied with option 589 C{zVariant} will be ignored. 590 @type zVariant: int 591 @param zVariant: I{Z-variant} of the first character 592 @rtype: int 593 @return: stroke count of given character 594 @raise NoInformationError: if no stroke count information available 595 @raise ValueError: if an invalid I{character locale} is specified 596 @attention: The quality of the returned data depends on the sources used 597 when compiling the database. Unihan itself only gives very general 598 stroke order information without being bound to a specific glyph. 599 """ 600 if locale != None: 601 zVariant = self.getLocaleDefaultZVariant(char, locale) 602 603 # if table exists use it 604 if self.hasStrokeCount: 605 table = self.db.tables['StrokeCount'] 606 result = self.db.selectScalar(select([table.c.StrokeCount], 607 and_(table.c.ChineseCharacter == char, 608 table.c.ZVariant == zVariant))) 609 if not result: 610 raise exception.NoInformationError( 611 "Character has no stroke count information") 612 return result 613 else: 614 # use incomplete way with using the stroke order (there might be 615 # less stroke order entries than stroke count entries) 616 try: 617 so = self.getStrokeOrder(char, zVariant=zVariant) 618 strokeList = so.replace(' ', '-').split('-') 619 return len(strokeList) 620 except exception.NoInformationError: 621 raise exception.NoInformationError( 622 "Character has no stroke count information")

623

624 - def getStrokeCountDict(self):

625 """ 626 Gets the full stroke count table from the database. 627 628 @rtype: dict 629 @return: dictionary of key pair character, Z-variant and value stroke 630 count 631 @attention: The quality of the returned data depends on the sources used 632 when compiling the database. Unihan itself only gives very general 633 stroke order information without being bound to a specific glyph. 634 """ 635 table = self.db.tables['StrokeCount'] 636 result = self.db.selectRows(select( 637 [table.c.ChineseCharacter, table.c.ZVariant, table.c.StrokeCount])) 638 return dict([((char, zVariant), strokeCount) \ 639 for char, zVariant, strokeCount in result])

640 641 #_strokeIndexLookup = {} 642 #"""A dictionary containing the stroke indices for a set index length.""" 643 #def getStrokeIndexLookup(self, indexLength): 644 #""" 645 #Gets a stroke lookup table for the given index length and assigns each 646 #stroke taken into account with an unique index. 647 648 #The first M{indexLength-1} most frequent strokes are taken into account, 649 #all other strokes are rejected from the index. 650 651 #@type indexLength: int 652 #@param indexLength: length of the index 653 #@rtype: dict 654 #@return: dictionary for performing stroke lookups 655 #""" 656 #if not self._strokeIndexLookup.has_key(indexLength): 657 #strokeTable = self.db.selectSoleValue('StrokeFrequency', 658 #'Stroke', orderBy = ['Frequency'], orderDescending=True, 659 #limit = indexLength) 660 #counter = 0 661 #strokeIndexLookup = {} 662 ## put all stroke abbreviations of stroke from strokeTable into dict 663 #for stroke in strokeTable: 664 #strokeIndexLookup[stroke] = counter 665 #counter = counter + 1 666 #self._strokeIndexLookup[indexLength] = strokeIndexLookup 667 #return self._strokeIndexLookup[indexLength] 668 669 #def _getStrokeBitField(self, strokeSet, bitLength=30): 670 #""" 671 #Gets the bigram bit field for the given stroke set. 672 673 #The first M{bitLength-1} strokes are assigned to one bit position, all 674 #other strokes are assigned to position M{bitLength}. Bits for strokes 675 #present are set to 1 all others to 0. 676 677 #@type strokeSet: list of str 678 #@param strokeSet: set of stroke types 679 #@type bitLength: int 680 #@param bitLength: length of the bit field 681 #@rtype: int 682 #@return: bit field with bits for present strokes set to 1 683 #""" 684 #strokeIndexLookup = self.getStrokeIndexLookup(bitLength-1) 685 ## now build bit field 686 #bitField = 0 687 #for strokeAbbrev in strokeSet: 688 #stroke = self.getStrokeForAbbrev(strokeAbbrev) 689 #if strokeIndexLookup.has_key(stroke): 690 #index = strokeIndexLookup[stroke] 691 #else: 692 #index = bitLength 693 #bitField = bitField | int(math.pow(2, index)) 694 #return bitField 695 696 #_bigramIndexLookup = {} 697 #"""A dictionary containing the bigram indices for a set bigram index 698 #length.""" 699 #def _getBigramIndexLookup(self, indexLength): 700 #""" 701 #Gets a bigram lookup table for the given index length and assigns each 702 #bigram taken into account with an unique index. 703 704 #The first M{indexLength-1} most frequent bigrams are taken into account, 705 #all other bigrams are rejected from the index. 706 707 #@type indexLength: int 708 #@param indexLength: length of the index 709 #@rtype: dict 710 #@return: dictionary for performing bigram lookups 711 #""" 712 #if not self._bigramIndexLookup.has_key(indexLength): 713 #counter = 0 714 #bigramIndexLookup = {} 715 ## put all stroke abbreviations of stroke from strokeTable into dict 716 #bigramTable = self.db.selectSoleValue('StrokeBigramFrequency', 717 #'StrokeBigram', orderBy = ['Frequency'], orderDescending = True, 718 #limit = indexLength) 719 #for bigram in bigramTable: 720 #bigramIndexLookup[bigram] = counter 721 #counter = counter + 1 722 #self._bigramIndexLookup[indexLength] = bigramIndexLookup 723 #return self._bigramIndexLookup[indexLength] 724 725 #def _getBigramBitField(self, strokeList, bitLength=30): 726 #""" 727 #Gets the bigram bit field for the given list of strokes. 728 729 #The first M{bitLength-1} bigrams are assigned to one bit position, all 730 #other bigrams are assigned to position M{bitLength}. Bits for bigrams 731 #present are set to 1 all others to 0. 732 733 #@type strokeList: list of str 734 #@param strokeList: list of stroke 735 #@type bitLength: int 736 #@param bitLength: length of the bit field 737 #@rtype: int 738 #@return: bit field with bits for present bigrams set to 1 739 #""" 740 #bigramIndexLookup = self._getBigramIndexLookup(bitLength-1) 741 ## now build bit field 742 #bitField = 0 743 #lastStroke = self.getStrokeForAbbrev(strokeList[0]) 744 #for strokeAbbrev in strokeList[1:]: 745 #stroke = self.getStrokeForAbbrev(strokeAbbrev) 746 #if bigramIndexLookup.has_key(lastStroke+stroke): 747 #index = bigramIndexLookup[lastStroke+stroke] 748 #else: 749 #index = bitLength 750 #bitField = bitField | int(math.pow(2, index)) 751 #return bitField 752 753 #def getStrokeOrderDistance(self, strokeOrderListA, strokeOrderListB, 754 #substitutionPenalty=1, insertionPenalty=1.5, deletionPenalty=1.5): 755 #""" 756 #Calculates the Levenshtein distance for the two given stroke orders. 757 758 #Stroke are given as abbreviated form. 759 760 #@type strokeOrderListA: list of str 761 #@param strokeOrderListA: strokes A ordered in list form 762 #@type strokeOrderListB: list of str 763 #@param strokeOrderListB: strokes B ordered in list form 764 #@type substitutionPenalty: float 765 #@param substitutionPenalty: penalty for substituting elements 766 #@type insertionPenalty: float 767 #@param insertionPenalty: penalty for inserting elements 768 #@type deletionPenalty: float 769 #@param deletionPenalty: penalty for deleting elements 770 #@rtype: float 771 #@return: Levenshtein distance of both stroke orders 772 #""" 773 #n = len(strokeOrderListA) 774 #m = len(strokeOrderListB) 775 #d = [[0 for i in range(0, n+1)] 776 #for j in range(0, m+1)] 777 #for i in range(0, n+1): 778 #d[0][i] = i 779 #for j in range(0, m+1): 780 #d[j][0] = j 781 #for i in range(1, n+1): 782 #for j in range(1, m+1): 783 #if strokeOrderListA[i-1] == strokeOrderListB[j-1]: 784 #subst = 0 785 #else: 786 #subst = substitutionPenalty 787 #d[j][i] = min(d[j-1][i-1] + subst, # substitution 788 #d[j][i-1] + insertionPenalty, # insertion 789 #d[j-1][i] + deletionPenalty) # deletion 790 #return d[m][n] 791 792 #def getCharactersForStrokes(self, strokeList, locale): 793 #""" 794 #Gets all characters for the given list of stroke types. 795 796 #Stroke types are given as abbreviated form. 797 798 #@type strokeList: list of str 799 #@param strokeList: list of stroke types 800 #@type locale: str 801 #@param locale: I{character locale} (one out of TCJKV) 802 #@rtype: list of tuple 803 #@return: list of character, Z-variant pairs having the same stroke types 804 #@raise ValueError: if an invalid I{character locale} is specified 805 #""" 806 #return self.db.select('StrokeBitField', 807 #['ChineseCharacter', 'ZVariant'], 808 # {'StrokeField': self._getStrokeBitField(strokeList), 809 #'Locale': self._locale(locale)}, 810 #orderBy = ['ChineseCharacter']) 811 812 #def getCharactersForStrokeOrder(self, strokeOrder, locale): 813 #""" 814 #Gets all characters for the given stroke order. 815 816 #Strokes are given as abbreviated form and can be separated by a 817 #space or a hyphen. 818 819 #@type strokeOrder: str 820 #@param strokeOrder: stroke order consisting of stroke abbreviations 821 #separated by a space or hyphen 822 #@type locale: str 823 #@param locale: I{character locale} (one out of TCJKV) 824 #@rtype: list of tuple 825 #@return: list of character, Z-variant pairs 826 #@raise ValueError: if an invalid I{character locale} is specified 827 #@bug: Table 'strokebitfield' doesn't seem to include entries from 828 #'strokeorder' but only from character decomposition table: 829 830 #>>> print ",".join([a for a,b in cjk.getCharactersForStrokes(['S','H','HZ'], 'C')]) 831 #亘,卓,占,古,叶,吉,吐,吕,咕,咭,哇,哩,唱,啡,坦,坫,垣,埋,旦,旧,早,旰,旱,旺,昌,玷,理,田,畦,眭,罟,罡,罩,罪,量,靼,鞋 832 833 #@bug: Character lookup from stroke order seems to be broken. 皿 is in 834 #database but wouldn't be found:: 835 #./cjknife -o S-HZ-S-S-H 836 #田旦占 837 #""" 838 #strokeList = strokeOrder.replace(' ', '-').split('-') 839 840 #results = self.db.select(['StrokeBitField s', 'BigramBitField b', 841 #'StrokeCount c'], ['s.ChineseCharacter', 's.ZVariant'], 842 # {'s.Locale': '=b.Locale', 's.Locale': '=c.Locale', 843 #'s.ChineseCharacter': '=b.ChineseCharacter', 844 #'s.ChineseCharacter': '=c.ChineseCharacter', 845 #'s.ZVariant': '=b.ZVariant', 's.ZVariant': '=c.ZVariant', 846 #'s.Locale': self._locale(locale), 847 #'s.StrokeField': self._getStrokeBitField(strokeList), 848 #'b.BigramField': self._getBigramBitField(strokeList), 849 #'c.StrokeCount': len(strokeList)}) 850 #resultList = [] 851 ## check exact match of stroke order for all possible matches 852 #for char, zVariant in results: 853 #so = self.getStrokeOrder(char, locale, zVariant) 854 #soList = so.replace(' ', '-').split('-') 855 #if soList == strokeList: 856 #resultList.append((char, zVariant)) 857 #return resultList 858 859 #def getCharactersForStrokeOrderFuzzy(self, strokeOrder, locale, minEstimate, 860 #strokeCountVariance=2, strokeVariance=2, bigramVariance=3): 861 #""" 862 #Gets all characters for the given stroke order reaching the minimum 863 #estimate using a fuzzy search as to allowing fault-tolerant searches. 864 865 #Strokes are given as abbreviated form and can be separated by a 866 #space or a hyphen. 867 868 #The search is commited by looking for equal stroke count, equal stroke 869 #types and stroke bigrams (following pairs of strokes). Specifying 870 #C{strokeCountVariance} for allowing variance in stroke count, 871 #C{strokeVariance} for variance in stroke occurrences (for frequent ones) 872 #and C{bigramVariance} for variance in frequent stroke bigrams can adapt 873 #query to fit needs of minimum estimate. Allowing less variances will 874 #result in faster queries but lesser results, thus possibly omiting good 875 #matches. 876 877 #An estimate on the first search results is calculated and only entries 878 #reaching over the specified minimum estimate are included in the output. 879 880 #@type strokeOrder: str 881 #@param strokeOrder: stroke order consisting of stroke abbreviations 882 #separated by a space or hyphen 883 #@type locale: str 884 #@param locale: I{character locale} (one out of TCJKV) 885 #@type minEstimate: int 886 #@param minEstimate: minimum estimate that entries in output have to 887 #reach 888 #@type strokeCountVariance: int 889 #@param strokeCountVariance: variance of stroke count 890 #@type strokeVariance: int 891 #@param strokeVariance: variance of stroke types 892 #@type bigramVariance: int 893 #@param bigramVariance: variance of stroke bigrams 894 #@rtype: list of tuple 895 #@return: list of character, Z-variant pairs 896 #@raise ValueError: if an invalid I{character locale} is specified 897 #""" 898 #strokeList = strokeOrder.replace(' ', '-').split('-') 899 #strokeCount = len(strokeList) 900 #strokeBitField = self._getStrokeBitField(strokeList) 901 #bigramBitField = self._getBigramBitField(strokeList) 902 #results = self.db.select(['StrokeBitField s', 'BigramBitField b', 903 #'StrokeCount c'], ['s.ChineseCharacter', 's.ZVariant'], 904 # {'s.Locale': '=b.Locale', 's.Locale': '=c.Locale', 905 #'s.ChineseCharacter': '=b.ChineseCharacter', 906 #'s.ChineseCharacter': '=c.ChineseCharacter', 907 #'s.ZVariant': '=b.ZVariant', 's.ZVariant': '=c.ZVariant', 908 #'s.Locale': self._locale(locale), 909 #'bit_count(s.StrokeField ^ ' + str(strokeBitField) + ')': 910 #'<=' + str(strokeVariance), 911 #'bit_count(b.BigramField ^ ' + str(bigramBitField) + ')': 912 #'<=' + str(bigramVariance), 913 #'c.StrokeCount': '>=' + str(strokeCount-strokeCountVariance), 914 #'c.StrokeCount': '<=' + str(strokeCount+strokeCountVariance)}, 915 #distinctValues=True) 916 #resultList = [] 917 #for char, zVariant in results: 918 #so = self.getStrokeOrder(char, locale, zVariant) 919 #soList = so.replace(' ', '-').split('-') 920 #estimate = 1.0 / \ 921 #(math.sqrt(1.0 + (8*float(self.getStrokeOrderDistance( 922 #strokeList, soList)) / strokeCount))) 923 #if estimate >= minEstimate: 924 #resultList.append((char, zVariant, estimate)) 925 #return resultList 926 927 _strokeLookup = None 928 """A dictionary containing stroke forms for stroke abbreviations."""

929 - def getStrokeForAbbrev(self, abbrev):

930 """ 931 Gets the stroke form for the given abbreviated name (e.g. 'HZ'). 932 933 @type abbrev: str 934 @param abbrev: abbreviated stroke name 935 @rtype: str 936 @return: Unicode stroke character 937 @raise ValueError: if invalid stroke abbreviation is specified 938 """ 939 # build stroke lookup table for the first time 940 if not self._strokeLookup: 941 self._strokeLookup = {} 942 table = self.db.tables['Strokes'] 943 result = self.db.selectRows(select( 944 [table.c.Stroke, table.c.StrokeAbbrev])) 945 for stroke, strokeAbbrev in result: 946 self._strokeLookup[strokeAbbrev] = stroke 947 if self._strokeLookup.has_key(abbrev): 948 return self._strokeLookup[abbrev] 949 else: 950 raise ValueError(abbrev + " is no valid stroke abbreviation")

951

952 - def getStrokeForName(self, name):

953 u""" 954 Gets the stroke form for the given name (e.g. '横折'). 955 956 @type name: str 957 @param name: Chinese name of stroke 958 @rtype: str 959 @return: Unicode stroke char 960 @raise ValueError: if invalid stroke name is specified 961 """ 962 table = self.db.tables['Strokes'] 963 stroke = self.db.selectScalar(select([table.c.Stroke], 964 table.c.Name == name)) 965 if stroke: 966 return stroke 967 else: 968 raise ValueError(name + " is no valid stroke name")

969

970 - def getStrokeOrder(self, char, locale=None, zVariant=0):

971 """ 972 Gets the stroke order sequence for the given character. 973 974 The stroke order is constructed using the character decomposition into 975 components. As the stroke order information for some components might be 976 not obtainable the returned stroke order might be partial. 977 978 @type char: str 979 @param char: Chinese character 980 @type locale: str 981 @param locale: I{character locale} (one out of TCJKV). Giving the locale 982 will apply the default I{Z-variant} defined by 983 L{getLocaleDefaultZVariant()}. The Z-variant supplied with option 984 C{zVariant} will be ignored. 985 @type zVariant: int 986 @param zVariant: I{Z-variant} of the first character 987 @rtype: str 988 @return: string of stroke abbreviations separated by spaces and hyphens. 989 @raise ValueError: if an invalid I{character locale} is specified 990 @raise NoInformationError: if no stroke order information available 991 @todo Lang: Add stroke order source to stroke order data so that in 992 general different and contradicting stroke order information can be 993 given. The user then could prefer several sources that in the order 994 given would be queried. 995 """ 996 def getStrokeOrderEntry(char, zVariant): 997 """ 998 Gets the stroke order sequence for the given character from the 999 database's stroke order lookup table. 1000 1001 @type char: str 1002 @param char: Chinese character 1003 @type zVariant: int 1004 @param zVariant: I{Z-variant} of the first character 1005 @rtype: str 1006 @return: string of stroke abbreviations separated by spaces and 1007 hyphens. 1008 @raise NoInformationError: if no stroke order information available 1009 @raise ValueError: if an invalid I{character locale} is specified 1010 """ 1011 table = self.db.tables['StrokeOrder'] 1012 result = self.db.selectScalar(select([table.c.StrokeOrder], 1013 and_(table.c.ChineseCharacter == char, 1014 table.c.ZVariant == zVariant), distinct=True)) 1015 if not result: 1016 raise exception.NoInformationError( 1017 "Character has no stroke order information") 1018 return result

1019 1020 def getFromDecomposition(decompositionTreeList): 1021 """ 1022 Gets stroke order from the tree of a single partition entry. 1023 1024 @type decompositionTreeList: list 1025 @param decompositionTreeList: list of decomposition trees to derive 1026 the stroke order from 1027 @rtype: str 1028 @return: string of stroke abbreviations separated by spaces and 1029 hyphens. 1030 @raise NoInformationError: if no stroke order information available 1031 """ 1032 1033 def getFromEntry(subTree, index=0): 1034 """ 1035 Goes through a single layer of a tree recursively. 1036 1037 @type subTree: list 1038 @param subTree: decomposition tree to derive the stroke order 1039 from 1040 @type index: int 1041 @param index: index of current layer 1042 @rtype: str 1043 @return: string of stroke abbreviations separated by spaces and 1044 hyphens. 1045 @raise NoInformationError: if no stroke order information 1046 available 1047 """ 1048 strokeOrder = [] 1049 if type(subTree[index]) != type(()): 1050 # IDS operator 1051 character = subTree[index] 1052 if self.isBinaryIDSOperator(character): 1053 # check for IDS operators we can't make any order 1054 # assumption about 1055 if character in [u'⿴', u'⿻']: 1056 raise exception.NoInformationError( 1057 "Character has no stroke order information") 1058 else: 1059 if character in [u'⿺', u'⿶']: 1060 # IDS operators with order right one first 1061 subSequence = [1, 0] 1062 else: 1063 # IDS operators with order left one first 1064 subSequence = [0, 1] 1065 # Get stroke order for both components 1066 subStrokeOrder = [] 1067 for i in range(0,2): 1068 so, index = getFromEntry(subTree, index+1) 1069 subStrokeOrder.append(so) 1070 # Append in proper order 1071 for seq in subSequence: 1072 strokeOrder.append(subStrokeOrder[seq]) 1073 elif self.isTrinaryIDSOperator(character): 1074 # Get stroke order for three components 1075 for i in range(0,3): 1076 so, index = getFromEntry(subTree, index+1) 1077 strokeOrder.append(so) 1078 else: 1079 # no IDS operator but character 1080 char, charZVariant, componentTree = subTree[index] 1081 # if the character is unknown or there is none raise 1082 if char == u'？': 1083 raise exception.NoInformationError( 1084 "Character has no stroke order information") 1085 else: 1086 # check if we have a stroke order entry first 1087 so = getStrokeOrderEntry(char, charZVariant) 1088 if not so: 1089 # no entry, so get from partition 1090 so = getFromDecomposition(componentTree) 1091 strokeOrder.append(so) 1092 return (' '.join(strokeOrder), index)

1093 1094 # Try to find a partition without unknown components, if more than 1095 # one partition is given (take the one with maximum entry length). 1096 # This ensures we will have a full stroke order if at least one 1097 # partition is complete. This is important as the database will 1098 # never be complete. 1099 strokeOrder = '' 1100 for decomposition in decompositionTreeList: 1101 try: 1102 so, i = getFromEntry(decomposition) 1103 if len(so) >= len(strokeOrder): 1104 strokeOrder = so 1105 except exception.NoInformationError: 1106 pass 1107 if not strokeOrder: 1108 raise exception.NoInformationError( 1109 "Character has no stroke order information") 1110 return strokeOrder 1111 1112 if locale != None: 1113 zVariant = self.getLocaleDefaultZVariant(char, locale) 1114 # if there is an entry for the whole character return it 1115 try: 1116 strokeOrder = getStrokeOrderEntry(char, zVariant) 1117 return strokeOrder 1118 except exception.NoInformationError: 1119 pass 1120 # try to decompose character into components and build stroke order 1121 decompositionTreeList = self.getDecompositionTreeList(char, 1122 zVariant=zVariant) 1123 strokeOrder = getFromDecomposition(decompositionTreeList) 1124 return strokeOrder 1125 1126 #} 1127 #{ Character radical functions 1128

1129 - def getCharacterKangxiRadicalIndex(self, char):

1130 """ 1131 Gets the Kangxi radical index for the given character as defined by the 1132 I{Unihan} database. 1133 1134 @type char: str 1135 @param char: Chinese character 1136 @rtype: int 1137 @return: Kangxi radical index 1138 @raise NoInformationError: if no Kangxi radical index information for 1139 given character 1140 """ 1141 table = self.db.tables['CharacterKangxiRadical'] 1142 result = self.db.selectScalar(select([table.c.RadicalIndex], 1143 table.c.ChineseCharacter == char)) 1144 if not result: 1145 raise exception.NoInformationError( 1146 "Character has no Kangxi radical information") 1147 return result

1148

1149 - def getCharacterKangxiRadicalResidualStrokeCount(self, char, locale=None, 1150 zVariant=0):

1151 u""" 1152 Gets the Kangxi radical form (either a I{Unicode radical form} or a 1153 I{Unicode radical variant}) found as a component in the character and 1154 the stroke count of the residual character components. 1155 1156 The representation of the included radical or radical variant form 1157 depends on the respective character variant and thus the form's 1158 Z-variant is returned. Some characters include the given radical more 1159 than once and in some cases the representation is different between 1160 those same forms thus in the general case several matches can be 1161 returned each entry with a different radical form Z-variant. In these 1162 cases the entries are sorted by their Z-variant. 1163 1164 There are characters which include both, the radical form and a variant 1165 form of the radical (e.g. 伦: 人 and 亻). In these cases both are 1166 returned. 1167 1168 This method will return radical forms regardless of the selected locale, 1169 e.g. radical ⻔ is returned for character 间, though this variant form is 1170 not recognised under a traditional locale (like the character itself). 1171 1172 @type char: str 1173 @param char: Chinese character 1174 @type locale: str 1175 @param locale: I{character locale} (one out of TCJKV). Giving the locale 1176 will apply the default I{Z-variant} defined by 1177 L{getLocaleDefaultZVariant()}. The Z-variant supplied with option 1178 C{zVariant} will be ignored. 1179 @type zVariant: int 1180 @param zVariant: I{Z-variant} of the first character 1181 @rtype: list of tuple 1182 @return: list of radical/variant form, its Z-variant, the main layout of 1183 the character (using a I{IDS operator}), the position of the radical 1184 wrt. layout (0, 1 or 2) and the residual stroke count. 1185 @raise NoInformationError: if no stroke count information available 1186 @raise ValueError: if an invalid I{character locale} is specified 1187 """ 1188 radicalIndex = self.getCharacterKangxiRadicalIndex(char) 1189 entries = self.getCharacterRadicalResidualStrokeCount(char, 1190 radicalIndex, locale, zVariant) 1191 if entries: 1192 return entries 1193 else: 1194 raise exception.NoInformationError( 1195 "Character has no radical form information")

1196

1197 - def getCharacterRadicalResidualStrokeCount(self, char, radicalIndex, 1198 locale=None, zVariant=0):

1199 u""" 1200 Gets the radical form (either a I{Unicode radical form} or a 1201 I{Unicode radical variant}) found as a component in the character and 1202 the stroke count of the residual character components. 1203 1204 This is a more general version of 1205 L{getCharacterKangxiRadicalResidualStrokeCount()} which is not limited 1206 to the mapping of characters to a Kangxi radical as done by Unihan. 1207 1208 @type char: str 1209 @param char: Chinese character 1210 @type radicalIndex: int 1211 @param radicalIndex: radical index 1212 @type locale: str 1213 @param locale: I{character locale} (one out of TCJKV). Giving the locale 1214 will apply the default I{Z-variant} defined by 1215 L{getLocaleDefaultZVariant()}. The Z-variant supplied with option 1216 C{zVariant} will be ignored. 1217 @type zVariant: int 1218 @param zVariant: I{Z-variant} of the first character 1219 @rtype: list of tuple 1220 @return: list of radical/variant form, its Z-variant, the main layout of 1221 the character (using a I{IDS operator}), the position of the radical 1222 wrt. layout (0, 1 or 2) and the residual stroke count. 1223 @raise NoInformationError: if no stroke count information available 1224 @raise ValueError: if an invalid I{character locale} is specified 1225 @todo Lang: Clarify on characters classified under a given radical 1226 but without any proper radical glyph found as component. 1227 @todo Lang: Clarify on different radical zVariants for the same radical 1228 form. At best this method should return one and only one radical 1229 form (glyph). 1230 @todo Impl: Give the I{Unicode radical form} and not the equivalent 1231 character form in the relevant table as to always return the pure 1232 radical form (also avoids duplicates). Then state: 1233 1234 If the included component has an appropriate I{Unicode radical form} 1235 or I{Unicode radical variant}, then this form is returned. In either 1236 case the radical form can be an ordinary character. 1237 """ 1238 if locale != None: 1239 zVariant = self.getLocaleDefaultZVariant(char, locale) 1240 table = self.db.tables['CharacterRadicalResidualStrokeCount'] 1241 entries = self.db.selectRows(select([table.c.RadicalForm, 1242 table.c.RadicalZVariant, table.c.MainCharacterLayout, 1243 table.c.RadicalRelativePosition, table.c.ResidualStrokeCount], 1244 and_(table.c.ChineseCharacter == char, table.c.ZVariant == zVariant, 1245 table.c.RadicalIndex == radicalIndex)).order_by( 1246 table.c.ResidualStrokeCount, table.c.RadicalZVariant, 1247 table.c.RadicalForm, table.c.MainCharacterLayout, 1248 table.c.RadicalRelativePosition)) 1249 # add key columns to sort order to make return value deterministic 1250 if entries: 1251 return entries 1252 else: 1253 raise exception.NoInformationError( 1254 "Character has no radical form information")

1255

1256 - def getCharacterRadicalResidualStrokeCountDict(self):

1257 """ 1258 Gets the full table of radical forms (either a I{Unicode radical form} 1259 or a I{Unicode radical variant}) found as a component in the character 1260 and the stroke count of the residual character components from the 1261 database. 1262 1263 A typical entry looks like 1264 C{(u'众', 0): {9: [(u'人', 0, u'⿱', 0, 4), (u'人', 0, u'⿻', 0, 4)]}}, 1265 and can be accessed as C{radicalDict[(u'众', 0)][9]} with the Chinese 1266 character, its Z-variant and Kangxi radical index. The values are given 1267 in the order I{radical form}, I{radical Z-variant}, I{character layout}, 1268 I{relative position of the radical} and finally the 1269 I{residual stroke count}. 1270 1271 @rtype: dict 1272 @return: dictionary of radical/residual stroke count entries. 1273 """ 1274 radicalDict = {} 1275 # get entries from database 1276 table = self.db.tables['CharacterRadicalResidualStrokeCount'] 1277 entries = self.db.selectRows(select([table.c.ChineseCharacter, 1278 table.c.ZVariant, table.c.RadicalIndex, table.c.RadicalForm, 1279 table.c.RadicalZVariant, table.c.MainCharacterLayout, 1280 table.c.RadicalRelativePosition, table.c.ResidualStrokeCount])\ 1281 .order_by(table.c.ResidualStrokeCount, table.c.RadicalZVariant, 1282 table.c.RadicalForm, table.c.MainCharacterLayout, 1283 table.c.RadicalRelativePosition)) 1284 for entry in entries: 1285 char, zVariant, radicalIndex, radicalForm, radicalZVariant, \ 1286 mainCharacterLayout, radicalReladtivePosition, \ 1287 residualStrokeCount = entry 1288 1289 if (char, zVariant) not in radicalDict: 1290 radicalDict[(char, zVariant)] = {} 1291 1292 if radicalIndex not in radicalDict[(char, zVariant)]: 1293 radicalDict[(char, zVariant)][radicalIndex] = [] 1294 1295 radicalDict[(char, zVariant)][radicalIndex].append( 1296 (radicalForm, radicalZVariant, mainCharacterLayout, \ 1297 radicalReladtivePosition, residualStrokeCount)) 1298 1299 return radicalDict

1300

1301 - def getCharacterKangxiResidualStrokeCount(self, char, locale=None, 1302 zVariant=0):

1303 u""" 1304 Gets the stroke count of the residual character components when leaving 1305 aside the radical form. 1306 1307 This method returns a subset of data with regards to 1308 L{getCharacterKangxiRadicalResidualStrokeCount()}. It may though offer 1309 more entries after all, as their might exists information only about 1310 the residual stroke count, but not about the concrete radical form. 1311 1312 @type char: str 1313 @param char: Chinese character 1314 @type locale: str 1315 @param locale: I{character locale} (one out of TCJKV). Giving the locale 1316 will apply the default I{Z-variant} defined by 1317 L{getLocaleDefaultZVariant()}. The Z-variant supplied with option 1318 C{zVariant} will be ignored. 1319 @type zVariant: int 1320 @param zVariant: I{Z-variant} of the first character 1321 @rtype: int 1322 @return: residual stroke count 1323 @raise NoInformationError: if no stroke count information available 1324 @raise ValueError: if an invalid I{character locale} is specified 1325 @attention: The quality of the returned data depends on the sources used 1326 when compiling the database. Unihan itself only gives very general 1327 stroke order information without being bound to a specific glyph. 1328 """ 1329 radicalIndex = self.getCharacterKangxiRadicalIndex(char) 1330 return self.getCharacterResidualStrokeCount(char, radicalIndex, 1331 locale, zVariant)

1332

1333 - def getCharacterResidualStrokeCount(self, char, radicalIndex, locale=None, 1334 zVariant=0):

1335 u""" 1336 Gets the stroke count of the residual character components when leaving 1337 aside the radical form. 1338 1339 This is a more general version of 1340 L{getCharacterKangxiResidualStrokeCount()} which is not limited to the 1341 mapping of characters to a Kangxi radical as done by Unihan. 1342 1343 @type char: str 1344 @param char: Chinese character 1345 @type radicalIndex: int 1346 @param radicalIndex: radical index 1347 @type locale: str 1348 @param locale: I{character locale} (one out of TCJKV). Giving the locale 1349 will apply the default I{Z-variant} defined by 1350 L{getLocaleDefaultZVariant()}. The Z-variant supplied with option 1351 C{zVariant} will be ignored. 1352 @type zVariant: int 1353 @param zVariant: I{Z-variant} of the first character 1354 @rtype: int 1355 @return: residual stroke count 1356 @raise NoInformationError: if no stroke count information available 1357 @raise ValueError: if an invalid I{character locale} is specified 1358 @attention: The quality of the returned data depends on the sources used 1359 when compiling the database. Unihan itself only gives very general 1360 stroke order information without being bound to a specific glyph. 1361 """ 1362 if locale != None: 1363 zVariant = self.getLocaleDefaultZVariant(char, locale) 1364 table = self.db.tables['CharacterResidualStrokeCount'] 1365 entry = self.db.selectScalar(select([table.c.ResidualStrokeCount], 1366 and_(table.c.ChineseCharacter == char, table.c.ZVariant == zVariant, 1367 table.c.RadicalIndex == radicalIndex))) 1368 if entry != None: 1369 return entry 1370 else: 1371 raise exception.NoInformationError( 1372 "Character has no residual stroke count information")

1373

1374 - def getCharacterResidualStrokeCountDict(self):

1375 """ 1376 Gets the full table of stroke counts of the residual character 1377 components from the database. 1378 1379 A typical entry looks like C{(u'众', 0): {9: [4]}}, 1380 and can be accessed as C{residualCountDict[(u'众', 0)][9]} with the 1381 Chinese character, its Z-variant and Kangxi radical index which then 1382 gives the I{residual stroke count}. 1383 1384 @rtype: dict 1385 @return: dictionary of radical/residual stroke count entries. 1386 """ 1387 residualCountDict = {} 1388 # get entries from database 1389 table = self.db.tables['CharacterResidualStrokeCount'] 1390 entries = self.db.selectRows(select([table.c.ChineseCharacter, 1391 table.c.ZVariant, table.c.RadicalIndex, 1392 table.c.ResidualStrokeCount])) 1393 for entry in entries: 1394 char, zVariant, radicalIndex, residualStrokeCount = entry 1395 1396 if (char, zVariant) not in residualCountDict: 1397 residualCountDict[(char, zVariant)] = {} 1398 1399 residualCountDict[(char, zVariant)][radicalIndex] \ 1400 = residualStrokeCount 1401 1402 return residualCountDict

1403

1404 - def getCharactersForKangxiRadicalIndex(self, radicalIndex):

1405 """ 1406 Gets all characters for the given Kangxi radical index. 1407 1408 @type radicalIndex: int 1409 @param radicalIndex: Kangxi radical index 1410 @rtype: list of str 1411 @return: list of matching Chinese characters 1412 @todo Docu: Write about how Unihan maps characters to a Kangxi radical. 1413 Especially Chinese simplified characters. 1414 @todo Lang: 6954 characters have no Kangxi radical. Provide integration 1415 for these (SELECT COUNT(*) FROM Unihan WHERE kRSUnicode IS NOT NULL 1416 AND kRSKangxi IS NULL;). 1417 """ 1418 table = self.db.tables['CharacterKangxiRadical'] 1419 return self.db.selectScalars(select([table.c.ChineseCharacter], 1420 table.c.RadicalIndex == radicalIndex))

1421

1422 - def getCharactersForRadicalIndex(self, radicalIndex):

1423 """ 1424 Gets all characters for the given radical index. 1425 1426 This is a more general version of 1427 L{getCharactersForKangxiRadicalIndex()} which is not limited to the 1428 mapping of characters to a Kangxi radical as done by Unihan and one 1429 character can show up under several different radical indices. 1430 1431 @type radicalIndex: int 1432 @param radicalIndex: Kangxi radical index 1433 @rtype: list of str 1434 @return: list of matching Chinese characters 1435 """ 1436 table = self.db.tables['CharacterResidualStrokeCount'] 1437 return self.db.selectScalars(select([table.c.ChineseCharacter], 1438 table.c.RadicalIndex == radicalIndex))

1439

1440 - def getResidualStrokeCountForKangxiRadicalIndex(self, radicalIndex):

1441 """ 1442 Gets all characters and residual stroke count for the given Kangxi 1443 radical index. 1444 1445 This brings together methods L{getCharactersForKangxiRadicalIndex()} and 1446 L{getCharacterResidualStrokeCountDict()} and reports all characters 1447 including the given Kangxi radical, additionally supplying the residual 1448 stroke count. 1449 1450 @type radicalIndex: int 1451 @param radicalIndex: Kangxi radical index 1452 @rtype: list of tuple 1453 @return: list of matching Chinese characters with residual stroke count 1454 """ 1455 kangxiTable = self.db.tables['CharacterKangxiRadical'] 1456 residualTable = self.db.tables['CharacterResidualStrokeCount'] 1457 return self.db.selectRows(select([residualTable.c.ChineseCharacter, 1458 residualTable.c.ResidualStrokeCount], 1459 kangxiTable.c.RadicalIndex == radicalIndex, 1460 from_obj=[residualTable.join(kangxiTable, 1461 and_(residualTable.c.ChineseCharacter \ 1462 == kangxiTable.c.ChineseCharacter, 1463 residualTable.c.RadicalIndex \ 1464 == kangxiTable.c.RadicalIndex))]))

1465

1466 - def getResidualStrokeCountForRadicalIndex(self, radicalIndex):

1467 """ 1468 Gets all characters and residual stroke count for the given radical 1469 index. 1470 1471 This brings together methods L{getCharactersForRadicalIndex()} and 1472 L{getCharacterResidualStrokeCountDict()} and reports all characters 1473 including the given radical without being limited to the mapping of 1474 characters to a Kangxi radical as done by Unihan, additionally supplying 1475 the residual stroke count. 1476 1477 @type radicalIndex: int 1478 @param radicalIndex: Kangxi radical index 1479 @rtype: list of tuple 1480 @return: list of matching Chinese characters with residual stroke count 1481 """ 1482 table = self.db.tables['CharacterResidualStrokeCount'] 1483 return self.db.selectRows( 1484 select([table.c.ChineseCharacter, table.c.ResidualStrokeCount], 1485 table.c.RadicalIndex == radicalIndex))

1486 1487 #} 1488 #{ Radical form functions 1489

1490 - def getKangxiRadicalForm(self, radicalIdx, locale):

1491 u""" 1492 Gets a I{Unicode radical form} for the given Kangxi radical index. 1493 1494 This method will always return a single non null value, even if there 1495 are several radical forms for one index. 1496 1497 @type radicalIdx: int 1498 @param radicalIdx: Kangxi radical index 1499 @type locale: str 1500 @param locale: I{character locale} (one out of TCJKV) 1501 @rtype: str 1502 @return: I{Unicode radical form} 1503 @raise ValueError: if an invalid I{character locale} or radical index is 1504 specified 1505 @todo Lang: Check if radicals for which multiple radical forms exists 1506 include a simplified form or other variation (e.g. ⻆, ⻝, ⺐). 1507 There are radicals for which a Chinese simplified character 1508 equivalent exists and that is mapped to a different radical under 1509 Unicode. 1510 """ 1511 if radicalIdx < 1 or radicalIdx > 214: 1512 raise ValueError("Radical index '" + unicode(radicalIdx) \ 1513 + "' not in range between 1 and 214") 1514 1515 table = self.db.tables['KangxiRadical'] 1516 radicalForms = self.db.selectScalars(select([table.c.Form], 1517 and_(table.c.RadicalIndex == radicalIdx, table.c.Type == 'R', 1518 table.c.Locale.like(self._locale(locale))))\ 1519 .order_by(table.c.SubIndex)) 1520 return radicalForms[0]

1521

1522 - def getKangxiRadicalVariantForms(self, radicalIdx, locale):

1523 """ 1524 Gets a list of I{Unicode radical variant}s for the given Kangxi radical 1525 index. 1526 1527 This method can return an empty list if there are no 1528 I{Unicode radical variant} forms. There might be non 1529 I{Unicode radical variant}s for this radial as character forms though. 1530 1531 @type radicalIdx: int 1532 @param radicalIdx: Kangxi radical index 1533 @type locale: str 1534 @param locale: I{character locale} (one out of TCJKV) 1535 @rtype: list of str 1536 @return: list of I{Unicode radical variant}s 1537 @raise ValueError: if an invalid I{character locale} is specified 1538 @todo Lang: Narrow locales, not all variant forms are valid under all 1539 locales. 1540 """ 1541 table = self.db.tables['KangxiRadical'] 1542 return self.db.selectScalars(select([table.c.Form], 1543 and_(table.c.RadicalIndex == radicalIdx, table.c.Type == 'V', 1544 table.c.Locale.like(self._locale(locale))))\ 1545 .order_by(table.c.SubIndex))

1546

1547 - def getKangxiRadicalIndex(self, radicalForm, locale=None):

1548 """ 1549 Gets the Kangxi radical index for the given form. 1550 1551 The given form might either be an I{Unicode radical form} or an 1552 I{equivalent character}. 1553 1554 If there is an entry for the given radical form it still might not be a 1555 radical under the given character locale. So specifying a locale allows 1556 strict radical handling. 1557 1558 @type radicalForm: str 1559 @param radicalForm: radical form 1560 @type locale: str 1561 @param locale: optional I{character locale} (one out of TCJKV) 1562 @rtype: int 1563 @return: Kangxi radical index 1564 @raise ValueError: if invalid I{character locale} or radical form is 1565 specified 1566 """ 1567 # check in radical table 1568 if locale: 1569 locale = self._locale(locale) 1570 else: 1571 locale = '%' 1572 1573 table = self.db.tables['KangxiRadical'] 1574 result = self.db.selectScalar(select([table.c.RadicalIndex], 1575 and_(table.c.Form == radicalForm, table.c.Locale.like(locale)))) 1576 if result: 1577 return result 1578 else: 1579 # check in radical equivalent character table, join tables 1580 kangxiTable = self.db.tables['KangxiRadical'] 1581 equivalentTable = self.db.tables['RadicalEquivalentCharacter'] 1582 result = self.db.selectScalars(select([kangxiTable.c.RadicalIndex], 1583 and_(equivalentTable.c.EquivalentForm == radicalForm, 1584 equivalentTable.c.Locale.like(locale), 1585 kangxiTable.c.Locale.like(locale)), 1586 from_obj=[kangxiTable.join(equivalentTable, 1587 kangxiTable.c.Form == equivalentTable.c.Form)])) 1588 1589 if result: 1590 return result[0] 1591 else: 1592 # check in isolated radical equivalent character table 1593 table = self.db.tables['KangxiRadicalIsolatedCharacter'] 1594 result = self.db.selectScalar(select([table.c.RadicalIndex], 1595 and_(table.c.EquivalentForm == radicalForm, 1596 table.c.Locale.like(locale)))) 1597 if result: 1598 return result 1599 raise ValueError(radicalForm + "is no valid Kangxi radical," \ 1600 + " variant form or equivalent character")

1601

1602 - def getKangxiRadicalRepresentativeCharacters(self, radicalIdx, locale):

1603 u""" 1604 Gets a list of characters that represent the radical for the given 1605 Kangxi radical index. 1606 1607 This includes the radical form(s), character equivalents 1608 and variant forms and equivalents. 1609 1610 E.g. character for I{to speak/to say/talk/word} (Pinyin I{yán}): 1611 ⾔ (0x2f94), 言 (0x8a00), ⻈ (0x2ec8), 讠 (0x8ba0), 訁 (0x8a01) 1612 1613 @type radicalIdx: int 1614 @param radicalIdx: Kangxi radical index 1615 @type locale: str 1616 @param locale: I{character locale} (one out of TCJKV) 1617 @rtype: list of str 1618 @return: list of Chinese characters representing the radical for the 1619 given index, including Unicode radical and variant forms and their 1620 equivalent real character forms 1621 @raise ValueError: if invalid I{character locale} specified 1622 """ 1623 kangxiTable = self.db.tables['KangxiRadical'] 1624 equivalentTable = self.db.tables['RadicalEquivalentCharacter'] 1625 isolatedTable = self.db.tables['KangxiRadicalIsolatedCharacter'] 1626 1627 return self.db.selectScalars(union( 1628 select([kangxiTable.c.Form], 1629 and_(kangxiTable.c.RadicalIndex == radicalIdx, 1630 kangxiTable.c.Locale.like(self._locale(locale)))), 1631 1632 select([equivalentTable.c.EquivalentForm], 1633 and_(kangxiTable.c.RadicalIndex == radicalIdx, 1634 equivalentTable.c.Locale.like(self._locale(locale)), 1635 kangxiTable.c.Locale.like(self._locale(locale))), 1636 from_obj=[kangxiTable.join(equivalentTable, 1637 kangxiTable.c.Form == equivalentTable.c.Form)]), 1638 1639 select([isolatedTable.c.EquivalentForm], 1640 and_(isolatedTable.c.RadicalIndex == radicalIdx, 1641 isolatedTable.c.Locale.like(self._locale(locale))))))

1642

1643 - def isKangxiRadicalFormOrEquivalent(self, form, locale=None):

1644 """ 1645 Checks if the given form is a Kangxi radical form or a radical 1646 equivalent. This includes I{Unicode radical form}s, 1647 I{Unicode radical variant}s, I{equivalent character} and 1648 I{isolated radical character}s. 1649 1650 If there is an entry for the given radical form it still might not be a 1651 radical under the given character locale. So specifying a locale allows 1652 strict radical handling. 1653 1654 @type form: str 1655 @param form: Chinese character 1656 @type locale: str 1657 @param locale: optional I{character locale} (one out of TCJKV) 1658 @rtype: bool 1659 @return: C{True} if given form is a radical or I{equivalent character}, 1660 C{False} otherwise 1661 @raise ValueError: if an invalid I{character locale} is specified 1662 """ 1663 try: 1664 self.getKangxiRadicalIndex(form, locale) 1665 return True 1666 except ValueError: 1667 return False

1668

1669 - def isRadicalChar(self, char):

1670 """ 1671 Checks if the given character is a I{Unicode radical form} or 1672 I{Unicode radical variant}. 1673 1674 This method does a quick Unicode code index checking. So there is no 1675 guarantee this form has actually a radical entry in the database. 1676 1677 @type char: str 1678 @param char: Chinese character 1679 @rtype: bool 1680 @return: C{True} if given form is a radical form, C{False} otherwise 1681 """ 1682 # check if Unicode code point of character lies in between U+2e80 and 1683 # U+2fd5 1684 return char >= u'⺀' and char <= u'⿕'

1685

1686 - def getRadicalFormEquivalentCharacter(self, radicalForm, locale):

1687 u""" 1688 Gets the I{equivalent character} of the given I{Unicode radical form} or 1689 I{Unicode radical variant}. 1690 1691 The mapping mostly follows the X{Han Radical folding} specified in 1692 the Draft X{Unicode Technical Report #30} X{Character Foldings} under 1693 U{http://www.unicode.org/unicode/reports/tr30/#HanRadicalFolding}. 1694 All radical forms except U+2E80 (⺀) have an equivalent character. These 1695 equivalent characters are not necessarily visual identical and can be 1696 subject to major variation. 1697 1698 This method may raise a UnsupportedError if there is no supported 1699 I{equivalent character} form. 1700 1701 @type radicalForm: str 1702 @param radicalForm: I{Unicode radical form} 1703 @type locale: str 1704 @param locale: I{character locale} (one out of TCJKV) 1705 @rtype: str 1706 @return: I{equivalent character} form 1707 @raise UnsupportedError: if there is no supported 1708 I{equivalent character} form 1709 @raise ValueError: if invalid I{character locale} or radical form is 1710 specified 1711 """ 1712 if not self.isRadicalChar(radicalForm): 1713 raise ValueError(radicalForm + " is no valid radical form") 1714 1715 table = self.db.tables['RadicalEquivalentCharacter'] 1716 equivChar = self.db.selectScalar(select([table.c.EquivalentForm], 1717 and_(table.c.Form == radicalForm, 1718 table.c.Locale.like(self._locale(locale))))) 1719 if equivChar: 1720 return equivChar 1721 else: 1722 raise exception.UnsupportedError( 1723 "no equivalent character supported for '" + radicalForm + "'")

1724

1725 - def getCharacterEquivalentRadicalForms(self, equivalentForm, locale):

1726 """ 1727 Gets I{Unicode radical form}s or I{Unicode radical variant}s for the 1728 given I{equivalent character}. 1729 1730 The mapping mostly follows the I{Han Radical folding} specified in 1731 the Draft I{Unicode Technical Report #30} I{Character Foldings} under 1732 U{http://www.unicode.org/unicode/reports/tr30/#HanRadicalFolding}. 1733 Several radical forms can be mapped to the same equivalent character 1734 and thus this method in general returns several values. 1735 1736 @type equivalentForm: str 1737 @param equivalentForm: Equivalent character of I{Unicode radical form} 1738 or I{Unicode radical variant} 1739 @type locale: str 1740 @param locale: I{character locale} (one out of TCJKV) 1741 @rtype: list of str 1742 @return: I{equivalent character} forms 1743 @raise ValueError: if invalid I{character locale} or equivalent 1744 character is specified 1745 """ 1746 table = self.db.tables['RadicalEquivalentCharacter'] 1747 result = self.db.selectScalars(select([table.c.Form], 1748 and_(table.c.EquivalentForm == equivalentForm, 1749 table.c.Locale.like(self._locale(locale))))) 1750 if result: 1751 return result 1752 else: 1753 raise ValueError(equivalentForm \ 1754 + " is no valid equivalent character under the given locale")

1755 1756 #} 1757 #{ Character component functions 1758 1759 IDS_BINARY = [u'⿰', u'⿱', u'⿴', u'⿵', u'⿶', u'⿷', u'⿸', u'⿹', u'⿺', 1760 u'⿻'] 1761 """ 1762 A list of I{binary IDS operator}s used to describe character decompositions. 1763 """ 1764 IDS_TRINARY = [u'⿲', u'⿳'] 1765 """ 1766 A list of I{trinary IDS operator}s used to describe character 1767 decompositions. 1768 """ 1769 1770 @classmethod

1771 - def isBinaryIDSOperator(cls, char):

1772 """ 1773 Checks if given character is a I{binary IDS operator}. 1774 1775 @type char: str 1776 @param char: Chinese character 1777 @rtype: bool 1778 @return: C{True} if I{binary IDS operator}, C{False} otherwise 1779 """ 1780 return char in set(cls.IDS_BINARY)

1781 1782 @classmethod

1783 - def isTrinaryIDSOperator(cls, char):

1784 """ 1785 Checks if given character is a I{trinary IDS operator}. 1786 1787 @type char: str 1788 @param char: Chinese character 1789 @rtype: bool 1790 @return: C{True} if I{trinary IDS operator}, C{False} otherwise 1791 """ 1792 return char in set(cls.IDS_TRINARY)

1793 1794 @classmethod

1795 - def isIDSOperator(cls, char):

1796 """ 1797 Checks if given character is an I{IDS operator}. 1798 1799 @type char: str 1800 @param char: Chinese character 1801 @rtype: bool 1802 @return: C{True} if I{IDS operator}, C{False} otherwise 1803 """ 1804 return cls.isBinaryIDSOperator(char) or cls.isTrinaryIDSOperator(char)

1805

1806 - def getCharactersForComponents(self, componentList, locale, 1807 includeEquivalentRadicalForms=True, resultIncludeRadicalForms=False):

1808 u""" 1809 Gets all characters that contain the given components. 1810 1811 If option C{includeEquivalentRadicalForms} is set, all equivalent forms 1812 will be search for when a Kangxi radical is given. 1813 1814 @type componentList: list of str 1815 @param componentList: list of character components 1816 @type locale: str 1817 @param locale: I{character locale} (one out of TCJKV) 1818 @type includeEquivalentRadicalForms: bool 1819 @param includeEquivalentRadicalForms: if C{True} then characters in the 1820 given component list are interpreted as representatives for their 1821 radical and all radical forms are included in the search. E.g. 肉 1822 will include ⺼ as a possible component. 1823 @type resultIncludeRadicalForms: bool 1824 @param resultIncludeRadicalForms: if C{True} the result will include 1825 I{Unicode radical forms} and I{Unicode radical variants} 1826 @rtype: list of tuple 1827 @return: list of pairs of matching characters and their Z-variants 1828 @raise ValueError: if an invalid I{character locale} is specified 1829 @todo Impl: Table of same character glyphs, including special radical 1830 forms (e.g. 言 and 訁). 1831 @todo Data: Adopt locale dependant Z-variants for parent characters 1832 (e.g. 鬼 in 隗愧嵬). 1833 @todo Data: Use radical forms and radical variant forms instead of 1834 equivalent characters in decomposition data. Mapping looses 1835 information. 1836 @todo Lang: By default we get the equivalent character for a radical 1837 form. In some cases these equivalent characters will be only 1838 abstractly related to the given radical form (e.g. being the main 1839 radical form), so that the result set will be too big and doesn't 1840 reflect the original query. Set up a table including only strict 1841 visual relations between radical forms and equivalent characters. 1842 Alternatively restrict decomposition data to only include radical 1843 forms if appropriate, so there would be no need for conversion. 1844 """ 1845 equivCharTable = [] 1846 for component in componentList: 1847 try: 1848 # check if component is a radical and get index 1849 radicalIdx = self.getKangxiRadicalIndex(component, locale) 1850 1851 componentEquivalents = [component] 1852 if includeEquivalentRadicalForms: 1853 # if includeRadicalVariants is set get all forms 1854 componentEquivalents = \ 1855 self.getKangxiRadicalRepresentativeCharacters( 1856 radicalIdx, locale) 1857 else: 1858 if self.isRadicalChar(component): 1859 try: 1860 componentEquivalents.append( 1861 self.getRadicalFormEquivalentCharacter( 1862 component, locale)) 1863 except exception.UnsupportedError: 1864 # pass if no equivalent char existent 1865 pass 1866 else: 1867 componentEquivalents.extend( 1868 self.getCharacterEquivalentRadicalForms(component, 1869 locale)) 1870 equivCharTable.append(componentEquivalents) 1871 except ValueError: 1872 equivCharTable.append([component]) 1873 1874 return self.getCharactersForEquivalentComponents(equivCharTable, locale, 1875 resultIncludeRadicalForms=resultIncludeRadicalForms)

1876

1877 - def getCharactersForEquivalentComponents(self, componentConstruct, 1878 locale=None, resultIncludeRadicalForms=False):

1879 u""" 1880 Gets all characters that contain at least one component per list entry, 1881 sorted by stroke count if available. 1882 1883 This is the general form of L{getCharactersForComponents()} and allows a 1884 set of characters per list entry of which at least one character must be 1885 a component in the given list. 1886 1887 If a I{character locale} is specified only characters will be returned 1888 for which the locale's default I{Z-variant}'s decomposition will apply 1889 to the given components. Otherwise all Z-variants will be considered. 1890 1891 @type componentConstruct: list of list of str 1892 @param componentConstruct: list of character components given as single 1893 characters or, for alternative characters, given as a list 1894 @type resultIncludeRadicalForms: bool 1895 @param resultIncludeRadicalForms: if C{True} the result will include 1896 I{Unicode radical forms} and I{Unicode radical variants} 1897 @type locale: str 1898 @param locale: I{character locale} (one out of TCJKV) 1899 @rtype: list of tuple 1900 @return: list of pairs of matching characters and their Z-variants 1901 @raise ValueError: if an invalid I{character locale} is specified 1902 """ 1903 if not componentConstruct: 1904 return [] 1905 1906 # create where clauses 1907 lookupTable = self.db.tables['ComponentLookup'] 1908 localeTable = self.db.tables['LocaleCharacterVariant'] 1909 strokeCountTable = self.db.tables['StrokeCount'] 1910 1911 joinTables = [] # join over all tables by char and z-Variant 1912 filters = [] # filter for locale and component 1913 1914 # generate filter for each component 1915 for i, characterList in enumerate(componentConstruct): 1916 lookupTableAlias = lookupTable.alias('s%d' % i) 1917 joinTables.append(lookupTableAlias) 1918 # find chars for components, also include 米 for [u'米', u'木']. 1919 filters.append(or_(lookupTableAlias.c.Component.in_(characterList), 1920 lookupTableAlias.c.ChineseCharacter.in_(characterList))) 1921 1922 # join with LocaleCharacterVariant and allow only forms matching the 1923 # given locale, unless no locale entry exists 1924 if locale: 1925 joinTables.append(localeTable) 1926 filters.append(or_(localeTable.c.Locale == None, 1927 localeTable.c.Locale.like(self._locale(locale)))) 1928 1929 # include stroke count to sort 1930 if self.hasStrokeCount: 1931 joinTables.append(strokeCountTable) 1932 1933 # chain tables together in a JOIN 1934 fromObject = joinTables[0] 1935 for table in joinTables[1:]: 1936 fromObject = fromObject.outerjoin(table, 1937 onclause=and_( 1938 table.c.ChineseCharacter \ 1939 == joinTables[0].c.ChineseCharacter, 1940 table.c.ZVariant == joinTables[0].c.ZVariant)) 1941 1942 sel = select([joinTables[0].c.ChineseCharacter, 1943 joinTables[0].c.ZVariant], and_(*filters), from_obj=[fromObject], 1944 distinct=True) 1945 if self.hasStrokeCount: 1946 sel = sel.order_by(strokeCountTable.c.StrokeCount) 1947 1948 result = self.db.selectRows(sel) 1949 1950 if not resultIncludeRadicalForms: 1951 # exclude radical characters found in decomposition 1952 result = [(char, zVariant) for char, zVariant in result \ 1953 if not self.isRadicalChar(char)] 1954 1955 return result

1956

1957 - def getDecompositionEntries(self, char, locale=None, zVariant=0):

1958 """ 1959 Gets the decomposition of the given character into components from the 1960 database. The resulting decomposition is only the first layer in a tree 1961 of possible paths along the decomposition as the components can be 1962 further subdivided. 1963 1964 There can be several decompositions for one character so a list of 1965 decomposition is returned. 1966 1967 Each entry in the result list consists of a list of characters (with its 1968 Z-variant) and IDS operators. 1969 1970 @type char: str 1971 @param char: Chinese character that is to be decomposed into components 1972 @type locale: str 1973 @param locale: I{character locale} (one out of TCJKV). Giving the locale 1974 will apply the default I{Z-variant} defined by 1975 L{getLocaleDefaultZVariant()}. The Z-variant supplied with option 1976 C{zVariant} will be ignored. 1977 @type zVariant: int 1978 @param zVariant: I{Z-variant} of the first character 1979 @rtype: list 1980 @return: list of first layer decompositions 1981 @raise ValueError: if an invalid I{character locale} is specified 1982 """ 1983 if locale != None: 1984 try: 1985 zVariant = self.getLocaleDefaultZVariant(char, locale) 1986 except exception.NoInformationError: 1987 # no decomposition available 1988 return [] 1989 1990 # get entries from database 1991 table = self.db.tables['CharacterDecomposition'] 1992 result = self.db.selectScalars(select([table.c.Decomposition], 1993 and_(table.c.ChineseCharacter == char, 1994 table.c.ZVariant == zVariant)).order_by(table.c.SubIndex)) 1995 1996 # extract character Z-variant information (example entry: '⿱卜[1]尸') 1997 return [self._getDecompositionFromString(decomposition) \ 1998 for decomposition in result]

1999

2000 - def getDecompositionEntriesDict(self):

2001 """ 2002 Gets the full decomposition table from the database. 2003 2004 @rtype: dict 2005 @return: dictionary with key pair character, Z-variant and the first 2006 layer decomposition as value 2007 """ 2008 decompDict = {} 2009 # get entries from database 2010 table = self.db.tables['CharacterDecomposition'] 2011 entries = self.db.selectRows(select([table.c.ChineseCharacter, 2012 table.c.ZVariant, table.c.Decomposition])\ 2013 .order_by(table.c.SubIndex)) 2014 for char, zVariant, decomposition in entries: 2015 if (char, zVariant) not in decompDict: 2016 decompDict[(char, zVariant)] = [] 2017 2018 decompDict[(char, zVariant)].append( 2019 self._getDecompositionFromString(decomposition)) 2020 2021 return decompDict

2022

2023 - def _getDecompositionFromString(self, decomposition):

2024 """ 2025 Gets a tuple representation with character/Z-variant of the given 2026 character's decomposition into components. 2027 2028 Example: Entry C{⿱尚[1]儿} will be returned as 2029 C{[u'⿱', (u'尚', 1), (u'儿', 0)]}. 2030 2031 @type decomposition: str 2032 @param decomposition: character decomposition with IDS operator, 2033 compontens and optional Z-variant index 2034 @rtype: list 2035 @return: decomposition with character/Z-variant tuples 2036 """ 2037 componentsList = [] 2038 index = 0 2039 while index < len(decomposition): 2040 char = decomposition[index] 2041 if self.isIDSOperator(char): 2042 componentsList.append(char) 2043 else: 2044 # is Chinese character 2045 if index+1 < len(decomposition)\ 2046 and decomposition[index+1] == '[': 2047 2048 endIndex = decomposition.index(']', index+1) 2049 # extract Z-variant information 2050 charZVariant = int(decomposition[index+2:endIndex]) 2051 index = endIndex 2052 else: 2053 # take default Z-variant if none specified 2054 charZVariant = 0 2055 componentsList.append((char, charZVariant)) 2056 index = index + 1 2057 return componentsList

2058

2059 - def getDecompositionTreeList(self, char, locale=None, zVariant=0):

2060 """ 2061 Gets the decomposition of the given character into components as a list 2062 of decomposition trees. 2063 2064 There can be several decompositions for one character so one tree per 2065 decomposition is returned. 2066 2067 Each entry in the result list consists of a list of characters (with its 2068 Z-variant and list of further decomposition) and IDS operators. If a 2069 character can be further subdivided, its containing list is non empty 2070 and includes yet another list of trees for the decomposition of the 2071 component. 2072 2073 @type char: str 2074 @param char: Chinese character that is to be decomposed into components 2075 @type locale: str 2076 @param locale: I{character locale} (one out of TCJKV). Giving the locale 2077 will apply the default I{Z-variant} defined by 2078 L{getLocaleDefaultZVariant()}. The Z-variant supplied with option 2079 C{zVariant} will be ignored. 2080 @type zVariant: int 2081 @param zVariant: I{Z-variant} of the first character 2082 @rtype: list 2083 @return: list of decomposition trees 2084 @raise ValueError: if an invalid I{character locale} is specified 2085 """ 2086 if locale != None: 2087 try: 2088 zVariant = self.getLocaleDefaultZVariant(char, locale) 2089 except exception.NoInformationError: 2090 # no decomposition available 2091 return [] 2092 2093 decompositionTreeList = [] 2094 # get tree for each decomposition 2095 for componentsList in self.getDecompositionEntries(char, 2096 zVariant=zVariant): 2097 decompositionTree = [] 2098 for component in componentsList: 2099 if type(component) != type(()): 2100 # IDS operator 2101 decompositionTree.append(component) 2102 else: 2103 # Chinese character with zVariant info 2104 character, characterZVariant = component 2105 # get partition of component recursively 2106 componentTree = self.getDecompositionTreeList(character, 2107 zVariant=characterZVariant) 2108 decompositionTree.append((character, characterZVariant, 2109 componentTree)) 2110 decompositionTreeList.append(decompositionTree) 2111 return decompositionTreeList

2112

2113 - def isComponentInCharacter(self, component, char, locale=None, zVariant=0, 2114 componentZVariant=None):

2115 """ 2116 Checks if the given character contains the second character as a 2117 component. 2118 2119 @type component: str 2120 @param component: character questioned to be a component 2121 @type char: str 2122 @param char: Chinese character 2123 @type locale: str 2124 @param locale: I{character locale} (one out of TCJKV). Giving the locale 2125 will apply the default I{Z-variant} defined by 2126 L{getLocaleDefaultZVariant()}. The Z-variant supplied with option 2127 C{zVariant} will be ignored. 2128 @type zVariant: int 2129 @param zVariant: I{Z-variant} of the first character 2130 @type componentZVariant: int 2131 @param componentZVariant: Z-variant of the component; if left out every 2132 Z-variant matches for that character. 2133 @rtype: bool 2134 @return: C{True} if C{component} is a component of the given character, 2135 C{False} otherwise 2136 @raise ValueError: if an invalid I{character locale} is specified 2137 @todo Impl: Implement means to check if the component is really not 2138 found, or if our data is just insufficient. 2139 """ 2140 if locale != None: 2141 try: 2142 zVariant = self.getLocaleDefaultZVariant(char, locale) 2143 except exception.NoInformationError: 2144 # TODO no way to check if our data is insufficent 2145 return False 2146 2147 # if table exists use it to speed up look up 2148 if self.hasComponentLookup: 2149 table = self.db.tables['ComponentLookup'] 2150 zVariants = self.db.selectScalars( 2151 select([table.c.ComponentZVariant], 2152 and_(table.c.ChineseCharacter == char, 2153 table.c.ZVariant == zVariant, 2154 table.c.Component == component))) 2155 return zVariants and (componentZVariant == None \ 2156 or componentZVariant in zVariants) 2157 else: 2158 # use slow way with going through the decomposition tree 2159 # get decomposition for the first character from table 2160 for componentsList in self.getDecompositionEntries(char, 2161 zVariant=zVariant): 2162 # got through decomposition and check for components 2163 for charComponent in componentsList: 2164 if type(charComponent) == type(()): 2165 character, characterZVariant = charComponent 2166 if character != u'？': 2167 # check if character and Z-variant match 2168 if character == component \ 2169 and (componentZVariant == None or 2170 characterZVariant == componentZVariant): 2171 return True 2172 # else recursively step into decomposition of 2173 # current component 2174 if self.isComponentInCharacter(character, component, 2175 zVariant=characterZVariant, 2176 componentZVariant=componentZVariant): 2177 return True 2178 return False

2179

Source Code for Module cjklib.characterlookup