1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18 """
19 Provides the central Chinese character based functions.
20 """
21
22
23 from sqlalchemy import select, union
24 from sqlalchemy.sql import and_, or_, not_
25
26 from cjklib import reading
27 from cjklib import exception
28 from cjklib import dbconnector
31 u"""
32 CharacterLookup provides access to lookup methods related to Han characters.
33
34 The real system of CharacterLookup lies in the database beneath where all
35 relevant data is stored. So for nearly all methods this class needs access
36 to a database. Thus on initialisation of the object a connection to a
37 database is established, the logic for this provided by the
38 L{DatabaseConnector}.
39
40 See the L{DatabaseConnector} for supported database systems.
41
42 CharacterLookup will try to read the config file from either /etc or the
43 users home folder. If none is present it will try to open a SQLite database
44 stored as C{db} in the same folder by default. You can override this
45 behaviour by specifying additional parameters on creation of the object.
46
47 Examples
48 ========
49 The following examples should give a quick view into how to use this
50 package.
51 - Create the CharacterLookup object with default settings
52 (read from cjklib.conf or 'cjklib.db' in same directory as default):
53
54 >>> from cjklib import characterlookup
55 >>> cjk = characterlookup.CharacterLookup()
56
57 - Get a list of characters, that are pronounced "국" in Korean:
58
59 >>> cjk.getCharactersForReading(u'국', 'Hangul')
60 [u'匊', u'國', u'局', u'掬', u'菊', u'跼', u'鞠', u'鞫', u'麯', u'麴']
61
62 - Check if a character is included in another character as a component:
63
64 >>> cjk.isComponentInCharacter(u'女', u'好')
65 True
66
67 - Get all Kangxi radical variants for Radical 184 (⾷) under the
68 traditional locale:
69
70 >>> cjk.getKangxiRadicalVariantForms(184, 'T')
71 [u'\u2ede', u'\u2edf']
72
73 X{Character locale}
74 ===================
75 During the development of characters in the different cultures character
76 appearances changed over time to that extent, that the handling of radicals,
77 character components and strokes needs to be distinguished, depending on the
78 locale.
79
80 To deal with this circumstance I{CharacterLookup} works with a character
81 locale. Most of the methods of this class ask for a locale to be specified.
82 In these cases the output of the method depends on the specified locale.
83
84 For example in the traditional locale 这 has 8 strokes, but in
85 simplified Chinese it has only 7, as the radical ⻌ has different stroke
86 counts, depending on the locale.
87
88 X{Z-variant}s
89 =============
90 One feature of Chinese characters is the glyph form describing the visual
91 representation. This feature doesn't need to be unique and so many
92 characters can be found in different writing variants e.g. character 福
93 (English: luck) which has numerous forms.
94
95 The Unicode Consortium does not include same characters of different
96 actual shape in the Unicode standard (called I{Z-variant}s), except a few
97 "double" entries which are included as to maintain backward compatibility.
98 In fact a code point represents an abstract character not defining any
99 visual representation. Thus a distinct appearance description including
100 strokes and stroke order cannot be simply assigned to a code point but one
101 needs to deal with the notion of I{Z-variants} representing distinct glyphs
102 to which a visual description can be applied.
103
104 The name Z-variant is derived from the three-dimensional model representing
105 the space of characters relative to three axis, being the X axis
106 representing the semantic space, the Y axis representing the abstract shape
107 space and finally the Z axis for typeface differences (see "Principles of
108 Han Unification" in: The Unicode Standard 5.0, chapter 12). Character
109 presentations only differing in the Z dimension are generally unified.
110
111 cjklib tries to offer a simple approach to handle different Z-variants. As
112 character components, strokes and the stroke order depend on this variant,
113 methods dealing with this kind will ask for a I{Z-variant} value to be
114 specified. In these cases the output of the method depends on the specified
115 variant.
116
117 Z-variants and character locales
118 --------------------------------
119 Deviant stroke count, stroke order or decomposition into character
120 components for different I{character locales} is implemented using different
121 I{Z-variant}s. For the example given above the entry 这 with 8 strokes is
122 given as one Z-variant and the form with 7 strokes is given as another
123 Z-variant.
124
125 In most cases one might only be interested in a single visual appearance,
126 the "standard" one. This visual appearance would be the one generally used
127 in the specific locale.
128
129 Instead of specifying a certain Z-variant most functions will allow for
130 passing of a character locale. Giving the locale will apply the default
131 Z-variant given by the mapping defined in the database which can be obtained
132 by calling L{getLocaleDefaultZVariant()}.
133
134 More complex relations as which of several Z-variants for a given character
135 are used in a given locale are not covered.
136
137 Kangxi radical functions
138 ========================
139 Using the Unihan database queries about the Kangxi radical of characters can
140 be made.
141 It is possible to get a Kangxi radical for a character or lookup all
142 characters for a given radical.
143
144 Unicode has extra code points for radical forms (e.g. ⾔), here called
145 X{Unicode radical form}s, and radical variant forms (e.g. ⻈), here called
146 X{Unicode radical variant}s. These characters should be used when explicitly
147 referring to their function as radicals.
148 For most of the radicals and variants their exist complementary character
149 forms which have the same appearance (e.g. 言 and 讠) and which shall be
150 called X{equivalent character}s here.
151
152 Mapping from one to another side is not trivially possible, as some forms
153 only exist as radical forms, some only as character forms, but from their
154 meaning used in the radical context (called X{isolated radical character}s
155 here, e.g. 訁 for Kangxi radical 149).
156
157 Additionally a one to one mapping can't be guaranteed, as some forms have
158 two or more equivalent forms in another domain, and mapping is highly
159 dependant on the locale.
160
161 CharacterLookup provides methods for dealing with this different kinds of
162 characters and the mapping between them.
163
164 X{Character decomposition}
165 ==========================
166 Many characters can be decomposed into two or more components, that again
167 are Chinese characters. This fact can be used in many ways, including
168 character lookup, finding patterns for font design or studying characters.
169 Even the stroke order and stroke count can be deduced from the stroke
170 information of the character's components.
171
172 Character decomposition is highly dependant on the appearance of the
173 character, so both I{Z-variant} and I{character locale} need to be clear
174 when looking at a decomposition into components.
175
176 More points render this task more complex: decomposition into one set of
177 components is not distinct, some characters can be broken down into
178 different sets. Furthermore sometimes one component can be given, but the
179 other component will not be encoded as a character in its own right.
180
181 These components again might be characters that contain further components
182 (again not distinct ones), thus a complex decomposition in several steps is
183 possible.
184
185 The basis for the character decomposition lies in the database, where all
186 decompositions are stored, using X{Ideographic Description Sequence}s
187 (I{IDS}). These sequences consist of Unicode X{IDS operator}s and characters
188 to describe the structure of the character. There are
189 X{binary IDS operator}s to describe decomposition into two components (e.g.
190 ⿰ for one component left, one right as in 好: ⿰女子) or
191 X{trinary IDS operator}s for decomposition into three components (e.g. ⿲
192 for three components from left to right as in 辨: ⿲⾟刂⾟). Using
193 I{IDS operator}s it is possible to give a basic structural information, that
194 in many cases is enough for example to derive a overall stroke order from
195 two single sets of stroke orders. Further more it is possible to look for
196 redundant information in different entries and thus helps to keep the
197 definition data clean.
198
199 This class provides methods for retrieving the basic partition entries,
200 lookup of characters by components and decomposing as a tree from the
201 character as a root down to the X{minimal components} as leaf nodes.
202
203 TODO: Policy about what to classify as partition.
204
205 Strokes
206 =======
207 Chinese characters consist of different strokes as basic parts. These
208 strokes are written in a mostly distinct order called the X{stroke order}
209 and have a distinct X{stroke count}.
210
211 The I{stroke order} in the writing of Chinese characters is important e.g.
212 for calligraphy or students learning new characters and is normally fixed as
213 there is only one possible stroke order for each character. Further more
214 there is a fixed set of possible strokes and these strokes carry names.
215
216 As with character decomposition the I{stroke order} and I{stroke count} is
217 highly dependant on the appearance of the character, so both I{Z-variant}
218 and I{character locale} need to be known.
219
220 Further more the order of strokes can be useful for lookup of characters,
221 and so CharacterLookup provides different methods for getting the stroke
222 count, stroke order, lookup of stroke names and lookup of characters by
223 stroke types and stroke order.
224
225 Most methods work with an abbreviation of stroke names using the first
226 letters of each syllable of the Chinese name in Pinyin.
227
228 The I{stroke order} is not always quite clear and even academics fight about
229 which order should be considered the correct one, a discussion that
230 shouldn't be taking lightly. This circumstance should be considered
231 when working with I{stroke order}s.
232
233 TODO: About plans of cjklib how to support different views on the stroke
234 order
235
236 TODO: About the different classifications of strokes
237
238 Readings
239 ========
240 See module L{reading} for a detailed description.
241
242 @see:
243 - Radicals:
244 U{http://en.wikipedia.org/wiki/Radical_(Chinese_character)}
245 - Z-variants:
246 U{http://www.unicode.org/reports/tr38/tr38-5.html#N10211}
247
248 @todo Fix: Incorporate stroke lookup (bigram) techniques
249 @todo Fix: How to handle character forms (either decomposition or stroke
250 order), that can only be found as a component in other characters? We
251 already mark them by flagging it with an 'S'.
252 @todo Impl: Think about applying locale at object creation time and not
253 passing it on every method call. Would make the class easier to use.
254 @todo Impl: Create a method for specifying which character range is of
255 interest for the return values of methods. Narrowing the return results
256 is a further way to locale dependant responses. E.g. cjknife could take
257 this into account when only displaying characters that can be displayed
258 with the current locale (BIG5, GBK...).
259 @todo Lang: Add option to component decomposition methods to stop on Kangxi
260 radical forms without breaking further down beyond those.
261 """
262
263 CHARARACTER_READING_MAPPING = {'Hangul': ('CharacterHangul', {}),
264 'Jyutping': ('CharacterJyutping', {'case': 'lower'}),
265 'Pinyin': ('CharacterPinyin', {'toneMarkType': 'Numbers',
266 'case': 'lower'})
267 }
268 """
269 A list of readings for which a character mapping exists including the
270 database's table name and the reading dialect parameters.
271
272 On conversion the first matching reading will be selected, so supplying
273 several equivalent readings has limited use.
274 """
275
276 - def __init__(self, databaseUrl=None, dbConnectInst=None):
277 """
278 Initialises the CharacterLookup.
279
280 If no parameters are given default values are assumed for the connection
281 to the database. The database connection parameters can be given in
282 databaseUrl, or an instance of L{DatabaseConnector} can be passed in
283 dbConnectInst, the latter one being preferred if both are specified.
284
285 @type databaseUrl: str
286 @param databaseUrl: database connection setting in the format
287 C{driver://user:pass@host/database}.
288 @type dbConnectInst: instance
289 @param dbConnectInst: instance of a L{DatabaseConnector}
290 """
291
292 if dbConnectInst:
293 self.db = dbConnectInst
294 else:
295 self.db = dbconnector.DatabaseConnector.getDBConnector(databaseUrl)
296
297 self.readingFactory = None
298
299
300 self.hasComponentLookup = self.db.engine.has_table('ComponentLookup')
301 self.hasStrokeCount = self.db.engine.has_table('StrokeCount')
302
304 """
305 Gets the L{ReadingFactory} instance.
306
307 @rtype: instance
308 @return: a L{ReadingFactory} instance.
309 """
310
311 if not self.readingFactory:
312 self.readingFactory = reading.ReadingFactory(dbConnectInst=self.db)
313 return self.readingFactory
314
315
316
318 """
319 Gets all know characters for the given reading.
320
321 @type readingString: str
322 @param readingString: reading string for lookup
323 @type readingN: str
324 @param readingN: name of reading
325 @param options: additional options for handling the reading input
326 @rtype: list of str
327 @return: list of characters for the given reading
328 @raise UnsupportedError: if no mapping between characters and target
329 reading exists.
330 @raise ConversionError: if conversion from the internal source reading
331 to the given target reading fails.
332 """
333
334
335 compatReading = self._getCompatibleCharacterReading(readingN)
336 tableName, compatOptions \
337 = self.CHARARACTER_READING_MAPPING[compatReading]
338
339
340
341 readingFactory = self._getReadingFactory()
342 if readingN != compatReading \
343 or readingFactory.isReadingConversionSupported(readingN, readingN):
344 readingString = readingFactory.convert(readingString, readingN,
345 compatReading, sourceOptions=options,
346 targetOptions=compatOptions)
347
348
349 table = self.db.tables[tableName]
350 return self.db.selectScalars(select([table.c.ChineseCharacter],
351 table.c.Reading==readingString).order_by(table.c.ChineseCharacter))
352
354 """
355 Gets all know readings for the character in the given target reading.
356
357 @type char: str
358 @param char: Chinese character for lookup
359 @type readingN: str
360 @param readingN: name of target reading
361 @param options: additional options for handling the reading output
362 @rtype: str
363 @return: list of readings for the given character
364 @raise UnsupportedError: if no mapping between characters and target
365 reading exists.
366 @raise ConversionError: if conversion from the internal source reading
367 to the given target reading fails.
368 """
369
370
371 compatReading = self._getCompatibleCharacterReading(readingN, False)
372 tableName, compatOptions \
373 = self.CHARARACTER_READING_MAPPING[compatReading]
374 readingFactory = self._getReadingFactory()
375
376
377 table = self.db.tables[tableName]
378 readings = self.db.selectScalars(select([table.c.Reading],
379 table.c.ChineseCharacter==char).order_by(table.c.Reading))
380
381
382 if compatReading != readingN \
383 or readingFactory.isReadingConversionSupported(readingN, readingN):
384
385
386 transReadings = []
387 for readingString in readings:
388 readingString = readingFactory.convert(readingString,
389 compatReading, readingN, sourceOptions=compatOptions,
390 targetOptions=options)
391 if readingString not in transReadings:
392 transReadings.append(readingString)
393 return transReadings
394 else:
395 return readings
396
398 """
399 Gets a reading where a mapping from to Chinese characters is supported
400 and that is compatible (a conversion is supported) to the given reading.
401
402 @type readingN: str
403 @param readingN: name of reading
404 @type toCharReading: bool
405 @param toCharReading: C{True} if conversion is done in direction to the
406 given reading, C{False} otherwise
407 @rtype: str
408 @return: a reading that is compatible to the given one and where
409 character lookup is supported
410 @raise UnsupportedError: if no mapping between characters and target
411 reading exists.
412 """
413
414
415 for characterReading in self.CHARARACTER_READING_MAPPING.keys():
416 if readingN == characterReading:
417 return characterReading
418 elif toCharReading:
419 if self._getReadingFactory().isReadingConversionSupported(
420 readingN, characterReading):
421 return characterReading
422 elif not toCharReading:
423 if self._getReadingFactory().isReadingConversionSupported(
424 characterReading, readingN):
425 return characterReading
426 raise exception.UnsupportedError("reading '" + readingN \
427 + "' not supported for character lookup")
428
429
430
432 """
433 Gets the locale search value for a database lookup on databases with
434 I{character locale} dependant content.
435
436 @type locale: str
437 @param locale: I{character locale} (one out of TCJKV)
438 @rtype: str
439 @return: search locale used for SQL select
440 @raise ValueError: if invalid I{character locale} specified
441 @todo Fix: This probably requires a full table scan
442 """
443 locale = locale.upper()
444 if not locale in set('TCJKV'):
445 raise ValueError("'" + locale + "' is not a valid character locale")
446 return '%' + locale + '%'
447
448
449
451 """
452 Gets the variant forms of the given type for the character.
453
454 The type can be one out of:
455 - C, I{compatible character} form (if character was added to Unicode
456 to maintain compatibility and round-trip convertibility)
457 - M, I{semantic variant} forms, which are often used interchangeably
458 instead of the character.
459 - P, I{specialised semantic variant} forms, which are often used
460 interchangeably instead of the character but limited to certain
461 contexts.
462 - Z, I{Z-variant} forms, which only differ in typeface (and would
463 have been unified if not to maintain round trip convertibility)
464 - S, I{simplified Chinese character} forms, originating from the
465 character simplification process of the PR China.
466 - T, I{traditional character} forms for a
467 I{simplified Chinese character}.
468
469 Variants depend on the locale which is not taken into account here. Thus
470 some of the returned characters might be only be variants under some
471 locales.
472
473 @type char: str
474 @param char: Chinese character
475 @type variantType: str
476 @param variantType: type of variant(s) to be returned
477 @rtype: list of str
478 @return: list of character variant(s) of given type
479
480 @todo Docu: Write about different kinds of variants
481 @todo Impl: Give a source on variant information as information can
482 contradict itself
483 (U{http://www.unicode.org/reports/tr38/tr38-5.html#N10211}). See
484 呆 (U+5446) which has one form each for semantic and specialised
485 semantic, each derived from a different source. Change also in
486 L{getAllCharacterVariants()}.
487 @todo Lang: What is the difference on Z-variants and
488 compatible variants? Some links between two characters are
489 bidirectional, some not. Is there any rule?
490 """
491 variantType = variantType.upper()
492 if not variantType in set('CMPZST'):
493 raise ValueError("'" + variantType \
494 + "' is not a valid variant type")
495
496 table = self.db.tables['CharacterVariant']
497 return self.db.selectScalars(select([table.c.Variant],
498 and_(table.c.ChineseCharacter == char,
499 table.c.Type == variantType)).order_by(table.c.Variant))
500
502 """
503 Gets all variant forms regardless of the type for the character.
504
505 A list of tuples is returned, including the character and its variant
506 type. See L{getCharacterVariants()} for variant types.
507
508 Variants depend on the locale which is not taken into account here. Thus
509 some of the returned characters might be only be variants under some
510 locales.
511
512 @type char: str
513 @param char: Chinese character
514 @rtype: list of tuple
515 @return: list of character variant(s) with their type
516 """
517 table = self.db.tables['CharacterVariant']
518 return self.db.selectRows(select([table.c.Variant, table.c.Type],
519 table.c.ChineseCharacter == char).order_by(table.c.Variant))
520
522 """
523 Gets the default Z-variant for the given character under the given
524 locale.
525
526 The Z-variant returned is an index to the internal database of different
527 character glyphs and represents the most common glyph used under the
528 given locale.
529
530 @type char: str
531 @param char: Chinese character
532 @type locale: str
533 @param locale: I{character locale} (one out of TCJKV)
534 @rtype: int
535 @return: Z-variant
536 @raise NoInformationError: if no Z-variant information is available
537 @raise ValueError: if invalid I{character locale} specified
538 """
539 table = self.db.tables['LocaleCharacterVariant']
540 zVariant = self.db.selectScalar(select([table.c.ZVariant],
541 and_(table.c.ChineseCharacter == char,
542 table.c.Locale.like(self._locale(locale))))\
543 .order_by(table.c.ZVariant))
544
545 if zVariant != None:
546 return zVariant
547 else:
548
549 return self.getCharacterZVariants(char)[0]
550
552 """
553 Gets a list of character Z-variant indices (glyphs) supported by the
554 database.
555
556 A Z-variant index specifies a particular character glyph which is needed
557 by several glyph-dependant methods instead of the abstract character
558 defined by Unicode.
559
560 @type char: str
561 @param char: Chinese character
562 @rtype: list of int
563 @return: list of supported Z-variants
564 @raise NoInformationError: if no Z-variant information is available
565 """
566
567 table = self.db.tables['ZVariants']
568 result = self.db.selectScalars(select([table.c.ZVariant],
569 table.c.ChineseCharacter == char).order_by(table.c.ZVariant))
570 if not result:
571 raise exception.NoInformationError(
572 "No Z-variant information available for '" + char + "'")
573
574 return result
575
576
577
578
580 """
581 Gets the stroke count for the given character.
582
583 @type char: str
584 @param char: Chinese character
585 @type locale: str
586 @param locale: I{character locale} (one out of TCJKV). Giving the locale
587 will apply the default I{Z-variant} defined by
588 L{getLocaleDefaultZVariant()}. The Z-variant supplied with option
589 C{zVariant} will be ignored.
590 @type zVariant: int
591 @param zVariant: I{Z-variant} of the first character
592 @rtype: int
593 @return: stroke count of given character
594 @raise NoInformationError: if no stroke count information available
595 @raise ValueError: if an invalid I{character locale} is specified
596 @attention: The quality of the returned data depends on the sources used
597 when compiling the database. Unihan itself only gives very general
598 stroke order information without being bound to a specific glyph.
599 """
600 if locale != None:
601 zVariant = self.getLocaleDefaultZVariant(char, locale)
602
603
604 if self.hasStrokeCount:
605 table = self.db.tables['StrokeCount']
606 result = self.db.selectScalar(select([table.c.StrokeCount],
607 and_(table.c.ChineseCharacter == char,
608 table.c.ZVariant == zVariant)))
609 if not result:
610 raise exception.NoInformationError(
611 "Character has no stroke count information")
612 return result
613 else:
614
615
616 try:
617 so = self.getStrokeOrder(char, zVariant=zVariant)
618 strokeList = so.replace(' ', '-').split('-')
619 return len(strokeList)
620 except exception.NoInformationError:
621 raise exception.NoInformationError(
622 "Character has no stroke count information")
623
625 """
626 Gets the full stroke count table from the database.
627
628 @rtype: dict
629 @return: dictionary of key pair character, Z-variant and value stroke
630 count
631 @attention: The quality of the returned data depends on the sources used
632 when compiling the database. Unihan itself only gives very general
633 stroke order information without being bound to a specific glyph.
634 """
635 table = self.db.tables['StrokeCount']
636 result = self.db.selectRows(select(
637 [table.c.ChineseCharacter, table.c.ZVariant, table.c.StrokeCount]))
638 return dict([((char, zVariant), strokeCount) \
639 for char, zVariant, strokeCount in result])
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927 _strokeLookup = None
928 """A dictionary containing stroke forms for stroke abbreviations."""
930 """
931 Gets the stroke form for the given abbreviated name (e.g. 'HZ').
932
933 @type abbrev: str
934 @param abbrev: abbreviated stroke name
935 @rtype: str
936 @return: Unicode stroke character
937 @raise ValueError: if invalid stroke abbreviation is specified
938 """
939
940 if not self._strokeLookup:
941 self._strokeLookup = {}
942 table = self.db.tables['Strokes']
943 result = self.db.selectRows(select(
944 [table.c.Stroke, table.c.StrokeAbbrev]))
945 for stroke, strokeAbbrev in result:
946 self._strokeLookup[strokeAbbrev] = stroke
947 if self._strokeLookup.has_key(abbrev):
948 return self._strokeLookup[abbrev]
949 else:
950 raise ValueError(abbrev + " is no valid stroke abbreviation")
951
953 u"""
954 Gets the stroke form for the given name (e.g. '横折').
955
956 @type name: str
957 @param name: Chinese name of stroke
958 @rtype: str
959 @return: Unicode stroke char
960 @raise ValueError: if invalid stroke name is specified
961 """
962 table = self.db.tables['Strokes']
963 stroke = self.db.selectScalar(select([table.c.Stroke],
964 table.c.Name == name))
965 if stroke:
966 return stroke
967 else:
968 raise ValueError(name + " is no valid stroke name")
969
971 """
972 Gets the stroke order sequence for the given character.
973
974 The stroke order is constructed using the character decomposition into
975 components. As the stroke order information for some components might be
976 not obtainable the returned stroke order might be partial.
977
978 @type char: str
979 @param char: Chinese character
980 @type locale: str
981 @param locale: I{character locale} (one out of TCJKV). Giving the locale
982 will apply the default I{Z-variant} defined by
983 L{getLocaleDefaultZVariant()}. The Z-variant supplied with option
984 C{zVariant} will be ignored.
985 @type zVariant: int
986 @param zVariant: I{Z-variant} of the first character
987 @rtype: str
988 @return: string of stroke abbreviations separated by spaces and hyphens.
989 @raise ValueError: if an invalid I{character locale} is specified
990 @raise NoInformationError: if no stroke order information available
991 @todo Lang: Add stroke order source to stroke order data so that in
992 general different and contradicting stroke order information can be
993 given. The user then could prefer several sources that in the order
994 given would be queried.
995 """
996 def getStrokeOrderEntry(char, zVariant):
997 """
998 Gets the stroke order sequence for the given character from the
999 database's stroke order lookup table.
1000
1001 @type char: str
1002 @param char: Chinese character
1003 @type zVariant: int
1004 @param zVariant: I{Z-variant} of the first character
1005 @rtype: str
1006 @return: string of stroke abbreviations separated by spaces and
1007 hyphens.
1008 @raise NoInformationError: if no stroke order information available
1009 @raise ValueError: if an invalid I{character locale} is specified
1010 """
1011 table = self.db.tables['StrokeOrder']
1012 result = self.db.selectScalar(select([table.c.StrokeOrder],
1013 and_(table.c.ChineseCharacter == char,
1014 table.c.ZVariant == zVariant), distinct=True))
1015 if not result:
1016 raise exception.NoInformationError(
1017 "Character has no stroke order information")
1018 return result
1019
1020 def getFromDecomposition(decompositionTreeList):
1021 """
1022 Gets stroke order from the tree of a single partition entry.
1023
1024 @type decompositionTreeList: list
1025 @param decompositionTreeList: list of decomposition trees to derive
1026 the stroke order from
1027 @rtype: str
1028 @return: string of stroke abbreviations separated by spaces and
1029 hyphens.
1030 @raise NoInformationError: if no stroke order information available
1031 """
1032
1033 def getFromEntry(subTree, index=0):
1034 """
1035 Goes through a single layer of a tree recursively.
1036
1037 @type subTree: list
1038 @param subTree: decomposition tree to derive the stroke order
1039 from
1040 @type index: int
1041 @param index: index of current layer
1042 @rtype: str
1043 @return: string of stroke abbreviations separated by spaces and
1044 hyphens.
1045 @raise NoInformationError: if no stroke order information
1046 available
1047 """
1048 strokeOrder = []
1049 if type(subTree[index]) != type(()):
1050
1051 character = subTree[index]
1052 if self.isBinaryIDSOperator(character):
1053
1054
1055 if character in [u'⿴', u'⿻']:
1056 raise exception.NoInformationError(
1057 "Character has no stroke order information")
1058 else:
1059 if character in [u'⿺', u'⿶']:
1060
1061 subSequence = [1, 0]
1062 else:
1063
1064 subSequence = [0, 1]
1065
1066 subStrokeOrder = []
1067 for i in range(0,2):
1068 so, index = getFromEntry(subTree, index+1)
1069 subStrokeOrder.append(so)
1070
1071 for seq in subSequence:
1072 strokeOrder.append(subStrokeOrder[seq])
1073 elif self.isTrinaryIDSOperator(character):
1074
1075 for i in range(0,3):
1076 so, index = getFromEntry(subTree, index+1)
1077 strokeOrder.append(so)
1078 else:
1079
1080 char, charZVariant, componentTree = subTree[index]
1081
1082 if char == u'?':
1083 raise exception.NoInformationError(
1084 "Character has no stroke order information")
1085 else:
1086
1087 so = getStrokeOrderEntry(char, charZVariant)
1088 if not so:
1089
1090 so = getFromDecomposition(componentTree)
1091 strokeOrder.append(so)
1092 return (' '.join(strokeOrder), index)
1093
1094
1095
1096
1097
1098
1099 strokeOrder = ''
1100 for decomposition in decompositionTreeList:
1101 try:
1102 so, i = getFromEntry(decomposition)
1103 if len(so) >= len(strokeOrder):
1104 strokeOrder = so
1105 except exception.NoInformationError:
1106 pass
1107 if not strokeOrder:
1108 raise exception.NoInformationError(
1109 "Character has no stroke order information")
1110 return strokeOrder
1111
1112 if locale != None:
1113 zVariant = self.getLocaleDefaultZVariant(char, locale)
1114
1115 try:
1116 strokeOrder = getStrokeOrderEntry(char, zVariant)
1117 return strokeOrder
1118 except exception.NoInformationError:
1119 pass
1120
1121 decompositionTreeList = self.getDecompositionTreeList(char,
1122 zVariant=zVariant)
1123 strokeOrder = getFromDecomposition(decompositionTreeList)
1124 return strokeOrder
1125
1126
1127
1128
1130 """
1131 Gets the Kangxi radical index for the given character as defined by the
1132 I{Unihan} database.
1133
1134 @type char: str
1135 @param char: Chinese character
1136 @rtype: int
1137 @return: Kangxi radical index
1138 @raise NoInformationError: if no Kangxi radical index information for
1139 given character
1140 """
1141 table = self.db.tables['CharacterKangxiRadical']
1142 result = self.db.selectScalar(select([table.c.RadicalIndex],
1143 table.c.ChineseCharacter == char))
1144 if not result:
1145 raise exception.NoInformationError(
1146 "Character has no Kangxi radical information")
1147 return result
1148
1151 u"""
1152 Gets the Kangxi radical form (either a I{Unicode radical form} or a
1153 I{Unicode radical variant}) found as a component in the character and
1154 the stroke count of the residual character components.
1155
1156 The representation of the included radical or radical variant form
1157 depends on the respective character variant and thus the form's
1158 Z-variant is returned. Some characters include the given radical more
1159 than once and in some cases the representation is different between
1160 those same forms thus in the general case several matches can be
1161 returned each entry with a different radical form Z-variant. In these
1162 cases the entries are sorted by their Z-variant.
1163
1164 There are characters which include both, the radical form and a variant
1165 form of the radical (e.g. 伦: 人 and 亻). In these cases both are
1166 returned.
1167
1168 This method will return radical forms regardless of the selected locale,
1169 e.g. radical ⻔ is returned for character 间, though this variant form is
1170 not recognised under a traditional locale (like the character itself).
1171
1172 @type char: str
1173 @param char: Chinese character
1174 @type locale: str
1175 @param locale: I{character locale} (one out of TCJKV). Giving the locale
1176 will apply the default I{Z-variant} defined by
1177 L{getLocaleDefaultZVariant()}. The Z-variant supplied with option
1178 C{zVariant} will be ignored.
1179 @type zVariant: int
1180 @param zVariant: I{Z-variant} of the first character
1181 @rtype: list of tuple
1182 @return: list of radical/variant form, its Z-variant, the main layout of
1183 the character (using a I{IDS operator}), the position of the radical
1184 wrt. layout (0, 1 or 2) and the residual stroke count.
1185 @raise NoInformationError: if no stroke count information available
1186 @raise ValueError: if an invalid I{character locale} is specified
1187 """
1188 radicalIndex = self.getCharacterKangxiRadicalIndex(char)
1189 entries = self.getCharacterRadicalResidualStrokeCount(char,
1190 radicalIndex, locale, zVariant)
1191 if entries:
1192 return entries
1193 else:
1194 raise exception.NoInformationError(
1195 "Character has no radical form information")
1196
1199 u"""
1200 Gets the radical form (either a I{Unicode radical form} or a
1201 I{Unicode radical variant}) found as a component in the character and
1202 the stroke count of the residual character components.
1203
1204 This is a more general version of
1205 L{getCharacterKangxiRadicalResidualStrokeCount()} which is not limited
1206 to the mapping of characters to a Kangxi radical as done by Unihan.
1207
1208 @type char: str
1209 @param char: Chinese character
1210 @type radicalIndex: int
1211 @param radicalIndex: radical index
1212 @type locale: str
1213 @param locale: I{character locale} (one out of TCJKV). Giving the locale
1214 will apply the default I{Z-variant} defined by
1215 L{getLocaleDefaultZVariant()}. The Z-variant supplied with option
1216 C{zVariant} will be ignored.
1217 @type zVariant: int
1218 @param zVariant: I{Z-variant} of the first character
1219 @rtype: list of tuple
1220 @return: list of radical/variant form, its Z-variant, the main layout of
1221 the character (using a I{IDS operator}), the position of the radical
1222 wrt. layout (0, 1 or 2) and the residual stroke count.
1223 @raise NoInformationError: if no stroke count information available
1224 @raise ValueError: if an invalid I{character locale} is specified
1225 @todo Lang: Clarify on characters classified under a given radical
1226 but without any proper radical glyph found as component.
1227 @todo Lang: Clarify on different radical zVariants for the same radical
1228 form. At best this method should return one and only one radical
1229 form (glyph).
1230 @todo Impl: Give the I{Unicode radical form} and not the equivalent
1231 character form in the relevant table as to always return the pure
1232 radical form (also avoids duplicates). Then state:
1233
1234 If the included component has an appropriate I{Unicode radical form}
1235 or I{Unicode radical variant}, then this form is returned. In either
1236 case the radical form can be an ordinary character.
1237 """
1238 if locale != None:
1239 zVariant = self.getLocaleDefaultZVariant(char, locale)
1240 table = self.db.tables['CharacterRadicalResidualStrokeCount']
1241 entries = self.db.selectRows(select([table.c.RadicalForm,
1242 table.c.RadicalZVariant, table.c.MainCharacterLayout,
1243 table.c.RadicalRelativePosition, table.c.ResidualStrokeCount],
1244 and_(table.c.ChineseCharacter == char, table.c.ZVariant == zVariant,
1245 table.c.RadicalIndex == radicalIndex)).order_by(
1246 table.c.ResidualStrokeCount, table.c.RadicalZVariant,
1247 table.c.RadicalForm, table.c.MainCharacterLayout,
1248 table.c.RadicalRelativePosition))
1249
1250 if entries:
1251 return entries
1252 else:
1253 raise exception.NoInformationError(
1254 "Character has no radical form information")
1255
1257 """
1258 Gets the full table of radical forms (either a I{Unicode radical form}
1259 or a I{Unicode radical variant}) found as a component in the character
1260 and the stroke count of the residual character components from the
1261 database.
1262
1263 A typical entry looks like
1264 C{(u'众', 0): {9: [(u'人', 0, u'⿱', 0, 4), (u'人', 0, u'⿻', 0, 4)]}},
1265 and can be accessed as C{radicalDict[(u'众', 0)][9]} with the Chinese
1266 character, its Z-variant and Kangxi radical index. The values are given
1267 in the order I{radical form}, I{radical Z-variant}, I{character layout},
1268 I{relative position of the radical} and finally the
1269 I{residual stroke count}.
1270
1271 @rtype: dict
1272 @return: dictionary of radical/residual stroke count entries.
1273 """
1274 radicalDict = {}
1275
1276 table = self.db.tables['CharacterRadicalResidualStrokeCount']
1277 entries = self.db.selectRows(select([table.c.ChineseCharacter,
1278 table.c.ZVariant, table.c.RadicalIndex, table.c.RadicalForm,
1279 table.c.RadicalZVariant, table.c.MainCharacterLayout,
1280 table.c.RadicalRelativePosition, table.c.ResidualStrokeCount])\
1281 .order_by(table.c.ResidualStrokeCount, table.c.RadicalZVariant,
1282 table.c.RadicalForm, table.c.MainCharacterLayout,
1283 table.c.RadicalRelativePosition))
1284 for entry in entries:
1285 char, zVariant, radicalIndex, radicalForm, radicalZVariant, \
1286 mainCharacterLayout, radicalReladtivePosition, \
1287 residualStrokeCount = entry
1288
1289 if (char, zVariant) not in radicalDict:
1290 radicalDict[(char, zVariant)] = {}
1291
1292 if radicalIndex not in radicalDict[(char, zVariant)]:
1293 radicalDict[(char, zVariant)][radicalIndex] = []
1294
1295 radicalDict[(char, zVariant)][radicalIndex].append(
1296 (radicalForm, radicalZVariant, mainCharacterLayout, \
1297 radicalReladtivePosition, residualStrokeCount))
1298
1299 return radicalDict
1300
1303 u"""
1304 Gets the stroke count of the residual character components when leaving
1305 aside the radical form.
1306
1307 This method returns a subset of data with regards to
1308 L{getCharacterKangxiRadicalResidualStrokeCount()}. It may though offer
1309 more entries after all, as their might exists information only about
1310 the residual stroke count, but not about the concrete radical form.
1311
1312 @type char: str
1313 @param char: Chinese character
1314 @type locale: str
1315 @param locale: I{character locale} (one out of TCJKV). Giving the locale
1316 will apply the default I{Z-variant} defined by
1317 L{getLocaleDefaultZVariant()}. The Z-variant supplied with option
1318 C{zVariant} will be ignored.
1319 @type zVariant: int
1320 @param zVariant: I{Z-variant} of the first character
1321 @rtype: int
1322 @return: residual stroke count
1323 @raise NoInformationError: if no stroke count information available
1324 @raise ValueError: if an invalid I{character locale} is specified
1325 @attention: The quality of the returned data depends on the sources used
1326 when compiling the database. Unihan itself only gives very general
1327 stroke order information without being bound to a specific glyph.
1328 """
1329 radicalIndex = self.getCharacterKangxiRadicalIndex(char)
1330 return self.getCharacterResidualStrokeCount(char, radicalIndex,
1331 locale, zVariant)
1332
1335 u"""
1336 Gets the stroke count of the residual character components when leaving
1337 aside the radical form.
1338
1339 This is a more general version of
1340 L{getCharacterKangxiResidualStrokeCount()} which is not limited to the
1341 mapping of characters to a Kangxi radical as done by Unihan.
1342
1343 @type char: str
1344 @param char: Chinese character
1345 @type radicalIndex: int
1346 @param radicalIndex: radical index
1347 @type locale: str
1348 @param locale: I{character locale} (one out of TCJKV). Giving the locale
1349 will apply the default I{Z-variant} defined by
1350 L{getLocaleDefaultZVariant()}. The Z-variant supplied with option
1351 C{zVariant} will be ignored.
1352 @type zVariant: int
1353 @param zVariant: I{Z-variant} of the first character
1354 @rtype: int
1355 @return: residual stroke count
1356 @raise NoInformationError: if no stroke count information available
1357 @raise ValueError: if an invalid I{character locale} is specified
1358 @attention: The quality of the returned data depends on the sources used
1359 when compiling the database. Unihan itself only gives very general
1360 stroke order information without being bound to a specific glyph.
1361 """
1362 if locale != None:
1363 zVariant = self.getLocaleDefaultZVariant(char, locale)
1364 table = self.db.tables['CharacterResidualStrokeCount']
1365 entry = self.db.selectScalar(select([table.c.ResidualStrokeCount],
1366 and_(table.c.ChineseCharacter == char, table.c.ZVariant == zVariant,
1367 table.c.RadicalIndex == radicalIndex)))
1368 if entry != None:
1369 return entry
1370 else:
1371 raise exception.NoInformationError(
1372 "Character has no residual stroke count information")
1373
1375 """
1376 Gets the full table of stroke counts of the residual character
1377 components from the database.
1378
1379 A typical entry looks like C{(u'众', 0): {9: [4]}},
1380 and can be accessed as C{residualCountDict[(u'众', 0)][9]} with the
1381 Chinese character, its Z-variant and Kangxi radical index which then
1382 gives the I{residual stroke count}.
1383
1384 @rtype: dict
1385 @return: dictionary of radical/residual stroke count entries.
1386 """
1387 residualCountDict = {}
1388
1389 table = self.db.tables['CharacterResidualStrokeCount']
1390 entries = self.db.selectRows(select([table.c.ChineseCharacter,
1391 table.c.ZVariant, table.c.RadicalIndex,
1392 table.c.ResidualStrokeCount]))
1393 for entry in entries:
1394 char, zVariant, radicalIndex, residualStrokeCount = entry
1395
1396 if (char, zVariant) not in residualCountDict:
1397 residualCountDict[(char, zVariant)] = {}
1398
1399 residualCountDict[(char, zVariant)][radicalIndex] \
1400 = residualStrokeCount
1401
1402 return residualCountDict
1403
1405 """
1406 Gets all characters for the given Kangxi radical index.
1407
1408 @type radicalIndex: int
1409 @param radicalIndex: Kangxi radical index
1410 @rtype: list of str
1411 @return: list of matching Chinese characters
1412 @todo Docu: Write about how Unihan maps characters to a Kangxi radical.
1413 Especially Chinese simplified characters.
1414 @todo Lang: 6954 characters have no Kangxi radical. Provide integration
1415 for these (SELECT COUNT(*) FROM Unihan WHERE kRSUnicode IS NOT NULL
1416 AND kRSKangxi IS NULL;).
1417 """
1418 table = self.db.tables['CharacterKangxiRadical']
1419 return self.db.selectScalars(select([table.c.ChineseCharacter],
1420 table.c.RadicalIndex == radicalIndex))
1421
1423 """
1424 Gets all characters for the given radical index.
1425
1426 This is a more general version of
1427 L{getCharactersForKangxiRadicalIndex()} which is not limited to the
1428 mapping of characters to a Kangxi radical as done by Unihan and one
1429 character can show up under several different radical indices.
1430
1431 @type radicalIndex: int
1432 @param radicalIndex: Kangxi radical index
1433 @rtype: list of str
1434 @return: list of matching Chinese characters
1435 """
1436 table = self.db.tables['CharacterResidualStrokeCount']
1437 return self.db.selectScalars(select([table.c.ChineseCharacter],
1438 table.c.RadicalIndex == radicalIndex))
1439
1441 """
1442 Gets all characters and residual stroke count for the given Kangxi
1443 radical index.
1444
1445 This brings together methods L{getCharactersForKangxiRadicalIndex()} and
1446 L{getCharacterResidualStrokeCountDict()} and reports all characters
1447 including the given Kangxi radical, additionally supplying the residual
1448 stroke count.
1449
1450 @type radicalIndex: int
1451 @param radicalIndex: Kangxi radical index
1452 @rtype: list of tuple
1453 @return: list of matching Chinese characters with residual stroke count
1454 """
1455 kangxiTable = self.db.tables['CharacterKangxiRadical']
1456 residualTable = self.db.tables['CharacterResidualStrokeCount']
1457 return self.db.selectRows(select([residualTable.c.ChineseCharacter,
1458 residualTable.c.ResidualStrokeCount],
1459 kangxiTable.c.RadicalIndex == radicalIndex,
1460 from_obj=[residualTable.join(kangxiTable,
1461 and_(residualTable.c.ChineseCharacter \
1462 == kangxiTable.c.ChineseCharacter,
1463 residualTable.c.RadicalIndex \
1464 == kangxiTable.c.RadicalIndex))]))
1465
1467 """
1468 Gets all characters and residual stroke count for the given radical
1469 index.
1470
1471 This brings together methods L{getCharactersForRadicalIndex()} and
1472 L{getCharacterResidualStrokeCountDict()} and reports all characters
1473 including the given radical without being limited to the mapping of
1474 characters to a Kangxi radical as done by Unihan, additionally supplying
1475 the residual stroke count.
1476
1477 @type radicalIndex: int
1478 @param radicalIndex: Kangxi radical index
1479 @rtype: list of tuple
1480 @return: list of matching Chinese characters with residual stroke count
1481 """
1482 table = self.db.tables['CharacterResidualStrokeCount']
1483 return self.db.selectRows(
1484 select([table.c.ChineseCharacter, table.c.ResidualStrokeCount],
1485 table.c.RadicalIndex == radicalIndex))
1486
1487
1488
1489
1521
1546
1548 """
1549 Gets the Kangxi radical index for the given form.
1550
1551 The given form might either be an I{Unicode radical form} or an
1552 I{equivalent character}.
1553
1554 If there is an entry for the given radical form it still might not be a
1555 radical under the given character locale. So specifying a locale allows
1556 strict radical handling.
1557
1558 @type radicalForm: str
1559 @param radicalForm: radical form
1560 @type locale: str
1561 @param locale: optional I{character locale} (one out of TCJKV)
1562 @rtype: int
1563 @return: Kangxi radical index
1564 @raise ValueError: if invalid I{character locale} or radical form is
1565 specified
1566 """
1567
1568 if locale:
1569 locale = self._locale(locale)
1570 else:
1571 locale = '%'
1572
1573 table = self.db.tables['KangxiRadical']
1574 result = self.db.selectScalar(select([table.c.RadicalIndex],
1575 and_(table.c.Form == radicalForm, table.c.Locale.like(locale))))
1576 if result:
1577 return result
1578 else:
1579
1580 kangxiTable = self.db.tables['KangxiRadical']
1581 equivalentTable = self.db.tables['RadicalEquivalentCharacter']
1582 result = self.db.selectScalars(select([kangxiTable.c.RadicalIndex],
1583 and_(equivalentTable.c.EquivalentForm == radicalForm,
1584 equivalentTable.c.Locale.like(locale),
1585 kangxiTable.c.Locale.like(locale)),
1586 from_obj=[kangxiTable.join(equivalentTable,
1587 kangxiTable.c.Form == equivalentTable.c.Form)]))
1588
1589 if result:
1590 return result[0]
1591 else:
1592
1593 table = self.db.tables['KangxiRadicalIsolatedCharacter']
1594 result = self.db.selectScalar(select([table.c.RadicalIndex],
1595 and_(table.c.EquivalentForm == radicalForm,
1596 table.c.Locale.like(locale))))
1597 if result:
1598 return result
1599 raise ValueError(radicalForm + "is no valid Kangxi radical," \
1600 + " variant form or equivalent character")
1601
1603 u"""
1604 Gets a list of characters that represent the radical for the given
1605 Kangxi radical index.
1606
1607 This includes the radical form(s), character equivalents
1608 and variant forms and equivalents.
1609
1610 E.g. character for I{to speak/to say/talk/word} (Pinyin I{yán}):
1611 ⾔ (0x2f94), 言 (0x8a00), ⻈ (0x2ec8), 讠 (0x8ba0), 訁 (0x8a01)
1612
1613 @type radicalIdx: int
1614 @param radicalIdx: Kangxi radical index
1615 @type locale: str
1616 @param locale: I{character locale} (one out of TCJKV)
1617 @rtype: list of str
1618 @return: list of Chinese characters representing the radical for the
1619 given index, including Unicode radical and variant forms and their
1620 equivalent real character forms
1621 @raise ValueError: if invalid I{character locale} specified
1622 """
1623 kangxiTable = self.db.tables['KangxiRadical']
1624 equivalentTable = self.db.tables['RadicalEquivalentCharacter']
1625 isolatedTable = self.db.tables['KangxiRadicalIsolatedCharacter']
1626
1627 return self.db.selectScalars(union(
1628 select([kangxiTable.c.Form],
1629 and_(kangxiTable.c.RadicalIndex == radicalIdx,
1630 kangxiTable.c.Locale.like(self._locale(locale)))),
1631
1632 select([equivalentTable.c.EquivalentForm],
1633 and_(kangxiTable.c.RadicalIndex == radicalIdx,
1634 equivalentTable.c.Locale.like(self._locale(locale)),
1635 kangxiTable.c.Locale.like(self._locale(locale))),
1636 from_obj=[kangxiTable.join(equivalentTable,
1637 kangxiTable.c.Form == equivalentTable.c.Form)]),
1638
1639 select([isolatedTable.c.EquivalentForm],
1640 and_(isolatedTable.c.RadicalIndex == radicalIdx,
1641 isolatedTable.c.Locale.like(self._locale(locale))))))
1642
1668
1670 """
1671 Checks if the given character is a I{Unicode radical form} or
1672 I{Unicode radical variant}.
1673
1674 This method does a quick Unicode code index checking. So there is no
1675 guarantee this form has actually a radical entry in the database.
1676
1677 @type char: str
1678 @param char: Chinese character
1679 @rtype: bool
1680 @return: C{True} if given form is a radical form, C{False} otherwise
1681 """
1682
1683
1684 return char >= u'⺀' and char <= u'⿕'
1685
1724
1755
1756
1757
1758
1759 IDS_BINARY = [u'⿰', u'⿱', u'⿴', u'⿵', u'⿶', u'⿷', u'⿸', u'⿹', u'⿺',
1760 u'⿻']
1761 """
1762 A list of I{binary IDS operator}s used to describe character decompositions.
1763 """
1764 IDS_TRINARY = [u'⿲', u'⿳']
1765 """
1766 A list of I{trinary IDS operator}s used to describe character
1767 decompositions.
1768 """
1769
1770 @classmethod
1772 """
1773 Checks if given character is a I{binary IDS operator}.
1774
1775 @type char: str
1776 @param char: Chinese character
1777 @rtype: bool
1778 @return: C{True} if I{binary IDS operator}, C{False} otherwise
1779 """
1780 return char in set(cls.IDS_BINARY)
1781
1782 @classmethod
1784 """
1785 Checks if given character is a I{trinary IDS operator}.
1786
1787 @type char: str
1788 @param char: Chinese character
1789 @rtype: bool
1790 @return: C{True} if I{trinary IDS operator}, C{False} otherwise
1791 """
1792 return char in set(cls.IDS_TRINARY)
1793
1794 @classmethod
1796 """
1797 Checks if given character is an I{IDS operator}.
1798
1799 @type char: str
1800 @param char: Chinese character
1801 @rtype: bool
1802 @return: C{True} if I{IDS operator}, C{False} otherwise
1803 """
1804 return cls.isBinaryIDSOperator(char) or cls.isTrinaryIDSOperator(char)
1805
1806 - def getCharactersForComponents(self, componentList, locale,
1807 includeEquivalentRadicalForms=True, resultIncludeRadicalForms=False):
1808 u"""
1809 Gets all characters that contain the given components.
1810
1811 If option C{includeEquivalentRadicalForms} is set, all equivalent forms
1812 will be search for when a Kangxi radical is given.
1813
1814 @type componentList: list of str
1815 @param componentList: list of character components
1816 @type locale: str
1817 @param locale: I{character locale} (one out of TCJKV)
1818 @type includeEquivalentRadicalForms: bool
1819 @param includeEquivalentRadicalForms: if C{True} then characters in the
1820 given component list are interpreted as representatives for their
1821 radical and all radical forms are included in the search. E.g. 肉
1822 will include ⺼ as a possible component.
1823 @type resultIncludeRadicalForms: bool
1824 @param resultIncludeRadicalForms: if C{True} the result will include
1825 I{Unicode radical forms} and I{Unicode radical variants}
1826 @rtype: list of tuple
1827 @return: list of pairs of matching characters and their Z-variants
1828 @raise ValueError: if an invalid I{character locale} is specified
1829 @todo Impl: Table of same character glyphs, including special radical
1830 forms (e.g. 言 and 訁).
1831 @todo Data: Adopt locale dependant Z-variants for parent characters
1832 (e.g. 鬼 in 隗 愧 嵬).
1833 @todo Data: Use radical forms and radical variant forms instead of
1834 equivalent characters in decomposition data. Mapping looses
1835 information.
1836 @todo Lang: By default we get the equivalent character for a radical
1837 form. In some cases these equivalent characters will be only
1838 abstractly related to the given radical form (e.g. being the main
1839 radical form), so that the result set will be too big and doesn't
1840 reflect the original query. Set up a table including only strict
1841 visual relations between radical forms and equivalent characters.
1842 Alternatively restrict decomposition data to only include radical
1843 forms if appropriate, so there would be no need for conversion.
1844 """
1845 equivCharTable = []
1846 for component in componentList:
1847 try:
1848
1849 radicalIdx = self.getKangxiRadicalIndex(component, locale)
1850
1851 componentEquivalents = [component]
1852 if includeEquivalentRadicalForms:
1853
1854 componentEquivalents = \
1855 self.getKangxiRadicalRepresentativeCharacters(
1856 radicalIdx, locale)
1857 else:
1858 if self.isRadicalChar(component):
1859 try:
1860 componentEquivalents.append(
1861 self.getRadicalFormEquivalentCharacter(
1862 component, locale))
1863 except exception.UnsupportedError:
1864
1865 pass
1866 else:
1867 componentEquivalents.extend(
1868 self.getCharacterEquivalentRadicalForms(component,
1869 locale))
1870 equivCharTable.append(componentEquivalents)
1871 except ValueError:
1872 equivCharTable.append([component])
1873
1874 return self.getCharactersForEquivalentComponents(equivCharTable, locale,
1875 resultIncludeRadicalForms=resultIncludeRadicalForms)
1876
1879 u"""
1880 Gets all characters that contain at least one component per list entry,
1881 sorted by stroke count if available.
1882
1883 This is the general form of L{getCharactersForComponents()} and allows a
1884 set of characters per list entry of which at least one character must be
1885 a component in the given list.
1886
1887 If a I{character locale} is specified only characters will be returned
1888 for which the locale's default I{Z-variant}'s decomposition will apply
1889 to the given components. Otherwise all Z-variants will be considered.
1890
1891 @type componentConstruct: list of list of str
1892 @param componentConstruct: list of character components given as single
1893 characters or, for alternative characters, given as a list
1894 @type resultIncludeRadicalForms: bool
1895 @param resultIncludeRadicalForms: if C{True} the result will include
1896 I{Unicode radical forms} and I{Unicode radical variants}
1897 @type locale: str
1898 @param locale: I{character locale} (one out of TCJKV)
1899 @rtype: list of tuple
1900 @return: list of pairs of matching characters and their Z-variants
1901 @raise ValueError: if an invalid I{character locale} is specified
1902 """
1903 if not componentConstruct:
1904 return []
1905
1906
1907 lookupTable = self.db.tables['ComponentLookup']
1908 localeTable = self.db.tables['LocaleCharacterVariant']
1909 strokeCountTable = self.db.tables['StrokeCount']
1910
1911 joinTables = []
1912 filters = []
1913
1914
1915 for i, characterList in enumerate(componentConstruct):
1916 lookupTableAlias = lookupTable.alias('s%d' % i)
1917 joinTables.append(lookupTableAlias)
1918
1919 filters.append(or_(lookupTableAlias.c.Component.in_(characterList),
1920 lookupTableAlias.c.ChineseCharacter.in_(characterList)))
1921
1922
1923
1924 if locale:
1925 joinTables.append(localeTable)
1926 filters.append(or_(localeTable.c.Locale == None,
1927 localeTable.c.Locale.like(self._locale(locale))))
1928
1929
1930 if self.hasStrokeCount:
1931 joinTables.append(strokeCountTable)
1932
1933
1934 fromObject = joinTables[0]
1935 for table in joinTables[1:]:
1936 fromObject = fromObject.outerjoin(table,
1937 onclause=and_(
1938 table.c.ChineseCharacter \
1939 == joinTables[0].c.ChineseCharacter,
1940 table.c.ZVariant == joinTables[0].c.ZVariant))
1941
1942 sel = select([joinTables[0].c.ChineseCharacter,
1943 joinTables[0].c.ZVariant], and_(*filters), from_obj=[fromObject],
1944 distinct=True)
1945 if self.hasStrokeCount:
1946 sel = sel.order_by(strokeCountTable.c.StrokeCount)
1947
1948 result = self.db.selectRows(sel)
1949
1950 if not resultIncludeRadicalForms:
1951
1952 result = [(char, zVariant) for char, zVariant in result \
1953 if not self.isRadicalChar(char)]
1954
1955 return result
1956
1958 """
1959 Gets the decomposition of the given character into components from the
1960 database. The resulting decomposition is only the first layer in a tree
1961 of possible paths along the decomposition as the components can be
1962 further subdivided.
1963
1964 There can be several decompositions for one character so a list of
1965 decomposition is returned.
1966
1967 Each entry in the result list consists of a list of characters (with its
1968 Z-variant) and IDS operators.
1969
1970 @type char: str
1971 @param char: Chinese character that is to be decomposed into components
1972 @type locale: str
1973 @param locale: I{character locale} (one out of TCJKV). Giving the locale
1974 will apply the default I{Z-variant} defined by
1975 L{getLocaleDefaultZVariant()}. The Z-variant supplied with option
1976 C{zVariant} will be ignored.
1977 @type zVariant: int
1978 @param zVariant: I{Z-variant} of the first character
1979 @rtype: list
1980 @return: list of first layer decompositions
1981 @raise ValueError: if an invalid I{character locale} is specified
1982 """
1983 if locale != None:
1984 try:
1985 zVariant = self.getLocaleDefaultZVariant(char, locale)
1986 except exception.NoInformationError:
1987
1988 return []
1989
1990
1991 table = self.db.tables['CharacterDecomposition']
1992 result = self.db.selectScalars(select([table.c.Decomposition],
1993 and_(table.c.ChineseCharacter == char,
1994 table.c.ZVariant == zVariant)).order_by(table.c.SubIndex))
1995
1996
1997 return [self._getDecompositionFromString(decomposition) \
1998 for decomposition in result]
1999
2001 """
2002 Gets the full decomposition table from the database.
2003
2004 @rtype: dict
2005 @return: dictionary with key pair character, Z-variant and the first
2006 layer decomposition as value
2007 """
2008 decompDict = {}
2009
2010 table = self.db.tables['CharacterDecomposition']
2011 entries = self.db.selectRows(select([table.c.ChineseCharacter,
2012 table.c.ZVariant, table.c.Decomposition])\
2013 .order_by(table.c.SubIndex))
2014 for char, zVariant, decomposition in entries:
2015 if (char, zVariant) not in decompDict:
2016 decompDict[(char, zVariant)] = []
2017
2018 decompDict[(char, zVariant)].append(
2019 self._getDecompositionFromString(decomposition))
2020
2021 return decompDict
2022
2024 """
2025 Gets a tuple representation with character/Z-variant of the given
2026 character's decomposition into components.
2027
2028 Example: Entry C{⿱尚[1]儿} will be returned as
2029 C{[u'⿱', (u'尚', 1), (u'儿', 0)]}.
2030
2031 @type decomposition: str
2032 @param decomposition: character decomposition with IDS operator,
2033 compontens and optional Z-variant index
2034 @rtype: list
2035 @return: decomposition with character/Z-variant tuples
2036 """
2037 componentsList = []
2038 index = 0
2039 while index < len(decomposition):
2040 char = decomposition[index]
2041 if self.isIDSOperator(char):
2042 componentsList.append(char)
2043 else:
2044
2045 if index+1 < len(decomposition)\
2046 and decomposition[index+1] == '[':
2047
2048 endIndex = decomposition.index(']', index+1)
2049
2050 charZVariant = int(decomposition[index+2:endIndex])
2051 index = endIndex
2052 else:
2053
2054 charZVariant = 0
2055 componentsList.append((char, charZVariant))
2056 index = index + 1
2057 return componentsList
2058
2060 """
2061 Gets the decomposition of the given character into components as a list
2062 of decomposition trees.
2063
2064 There can be several decompositions for one character so one tree per
2065 decomposition is returned.
2066
2067 Each entry in the result list consists of a list of characters (with its
2068 Z-variant and list of further decomposition) and IDS operators. If a
2069 character can be further subdivided, its containing list is non empty
2070 and includes yet another list of trees for the decomposition of the
2071 component.
2072
2073 @type char: str
2074 @param char: Chinese character that is to be decomposed into components
2075 @type locale: str
2076 @param locale: I{character locale} (one out of TCJKV). Giving the locale
2077 will apply the default I{Z-variant} defined by
2078 L{getLocaleDefaultZVariant()}. The Z-variant supplied with option
2079 C{zVariant} will be ignored.
2080 @type zVariant: int
2081 @param zVariant: I{Z-variant} of the first character
2082 @rtype: list
2083 @return: list of decomposition trees
2084 @raise ValueError: if an invalid I{character locale} is specified
2085 """
2086 if locale != None:
2087 try:
2088 zVariant = self.getLocaleDefaultZVariant(char, locale)
2089 except exception.NoInformationError:
2090
2091 return []
2092
2093 decompositionTreeList = []
2094
2095 for componentsList in self.getDecompositionEntries(char,
2096 zVariant=zVariant):
2097 decompositionTree = []
2098 for component in componentsList:
2099 if type(component) != type(()):
2100
2101 decompositionTree.append(component)
2102 else:
2103
2104 character, characterZVariant = component
2105
2106 componentTree = self.getDecompositionTreeList(character,
2107 zVariant=characterZVariant)
2108 decompositionTree.append((character, characterZVariant,
2109 componentTree))
2110 decompositionTreeList.append(decompositionTree)
2111 return decompositionTreeList
2112
2115 """
2116 Checks if the given character contains the second character as a
2117 component.
2118
2119 @type component: str
2120 @param component: character questioned to be a component
2121 @type char: str
2122 @param char: Chinese character
2123 @type locale: str
2124 @param locale: I{character locale} (one out of TCJKV). Giving the locale
2125 will apply the default I{Z-variant} defined by
2126 L{getLocaleDefaultZVariant()}. The Z-variant supplied with option
2127 C{zVariant} will be ignored.
2128 @type zVariant: int
2129 @param zVariant: I{Z-variant} of the first character
2130 @type componentZVariant: int
2131 @param componentZVariant: Z-variant of the component; if left out every
2132 Z-variant matches for that character.
2133 @rtype: bool
2134 @return: C{True} if C{component} is a component of the given character,
2135 C{False} otherwise
2136 @raise ValueError: if an invalid I{character locale} is specified
2137 @todo Impl: Implement means to check if the component is really not
2138 found, or if our data is just insufficient.
2139 """
2140 if locale != None:
2141 try:
2142 zVariant = self.getLocaleDefaultZVariant(char, locale)
2143 except exception.NoInformationError:
2144
2145 return False
2146
2147
2148 if self.hasComponentLookup:
2149 table = self.db.tables['ComponentLookup']
2150 zVariants = self.db.selectScalars(
2151 select([table.c.ComponentZVariant],
2152 and_(table.c.ChineseCharacter == char,
2153 table.c.ZVariant == zVariant,
2154 table.c.Component == component)))
2155 return zVariants and (componentZVariant == None \
2156 or componentZVariant in zVariants)
2157 else:
2158
2159
2160 for componentsList in self.getDecompositionEntries(char,
2161 zVariant=zVariant):
2162
2163 for charComponent in componentsList:
2164 if type(charComponent) == type(()):
2165 character, characterZVariant = charComponent
2166 if character != u'?':
2167
2168 if character == component \
2169 and (componentZVariant == None or
2170 characterZVariant == componentZVariant):
2171 return True
2172
2173
2174 if self.isComponentInCharacter(character, component,
2175 zVariant=characterZVariant,
2176 componentZVariant=componentZVariant):
2177 return True
2178 return False
2179