Package cjklib :: Package reading
[hide private]
[frames] | no frames]

Source Code for Package cjklib.reading

   1  #!/usr/bin/python 
   2  # -*- coding: utf-8 -*- 
   3  # This file is part of cjklib. 
   4  # 
   5  # cjklib is free software: you can redistribute it and/or modify 
   6  # it under the terms of the GNU Lesser General Public License as published by 
   7  # the Free Software Foundation, either version 3 of the License, or 
   8  # (at your option) any later version. 
   9  # 
  10  # cjklib is distributed in the hope that it will be useful, 
  11  # but WITHOUT ANY WARRANTY; without even the implied warranty of 
  12  # MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the 
  13  # GNU Lesser General Public License for more details. 
  14  # 
  15  # You should have received a copy of the GNU Lesser General Public License 
  16  # along with cjklib.  If not, see <http://www.gnu.org/licenses/>. 
  17   
  18  u""" 
  19  Provides the Chinese character reading based functions. 
  20  This includes L{ReadingOperator}s used to handle basic operations like 
  21  decomposing strings written in a reading into their basic entities (e.g. 
  22  syllables) and for some languages getting tonal information, syllable onset and 
  23  rhyme and other features. Furthermore it includes L{ReadingConverter}s which 
  24  offer the conversion of strings from one reading to another. 
  25   
  26  All basic functionality can be accessed using the L{ReadingFactory} which 
  27  provides factory methods for creating instances of the supplied classes and also 
  28  acts as a façade for the functions defined there. 
  29   
  30  Examples 
  31  ======== 
  32  The following examples should give a quick view into how to use this 
  33  package. 
  34      - Create the ReadingFactory object with default settings 
  35          (read from cjklib.conf or using cjklib.db in same directory as default): 
  36   
  37          >>> from cjklib.reading import ReadingFactory 
  38          >>> f = ReadingFactory() 
  39   
  40      - Create an operator for Mandarin romanisation Pinyin: 
  41   
  42          >>> pinyinOp = f.createReadingOperator('Pinyin') 
  43   
  44      - Construct a Pinyin syllable with second tone: 
  45   
  46          >>> pinyinOp.getTonalEntity(u'han', 2) 
  47          u'hán' 
  48   
  49      - Segments the given Pinyin string into a list of syllables: 
  50   
  51          >>> pinyinOp.decompose(u"tiān'ānmén") 
  52          [u'ti\u0101n', u"'", u'\u0101n', u'm\xe9n'] 
  53   
  54      - Do the same using the factory class as a façade to easily access 
  55          functions provided by those classes in the background: 
  56   
  57          >>> f.decompose(u"tiān'ānmén", 'Pinyin') 
  58          [u'ti\u0101n', u"'", u'\u0101n', u'm\xe9n'] 
  59   
  60      - Convert the given Gwoyeu Romatzyh syllables to their pronunciation in IPA: 
  61   
  62          >>> f.convert('liow shu', 'GR', 'MandarinIPA') 
  63          u'li\u0259u\u02e5\u02e9 \u0282u\u02e5\u02e5' 
  64   
  65  Readings 
  66  ======== 
  67  Han-characters give only few visual hints about how they are pronounced. The big 
  68  number of homophones further increases the problem of deriving the character's 
  69  actual pronunciation from the given glyph. This module implements a framework 
  70  and desirable functionality to deal with the characteristics of 
  71  X{character reading}s. 
  72   
  73  From a programmatical view point readings in languages making use of Chinese 
  74  characters differ in many ways. Some use the Roman alphabet, some have tonal 
  75  information, some can be mapped character-wise, some map from one Chinese 
  76  character to a sequence of characters in the target system while some map only 
  77  to one character. 
  78   
  79  One mayor group in the topic of readings are X{romanisations}, which are 
  80  transcriptions into the Roman alphabet (Cyrillic respectively). Romanisations 
  81  of tonal languages are a subgroup that ask for even more detailed functions. The 
  82  interface implemented here tries to grasp similar factors on different 
  83  abstraction levels while trying to maintain flexibility. 
  84   
  85  In the context of this library the term I{reading} will refer to two things: the 
  86  realisation of expressing the pronunciation (e.g. the specific romanisation) on 
  87  the one hand, and the specific reading of a given character on the other hand. 
  88   
  89  Technical Implementation 
  90  ======================== 
  91  While module L{characterlookup} includes the functions for mapping a character 
  92  to its potential reading, module C{reading} is specialised on all functionality 
  93  that is primarily connected to the reading of characters. 
  94   
  95  The main functions implemented here provide ways of handling text written in a 
  96  reading and converting between different readings. 
  97   
  98  Handling text written in a reading 
  99  ---------------------------------- 
 100  Text written in a I{character reading} is special to other text, as it consists 
 101  of entities which map to corresponding Chinese characters. They can be deduced 
 102  from the text through breaking the whole string down into a sequence of single 
 103  entities. This functionality is provided by all operators on readings by 
 104  providing the interface L{ReadingOperator}. The process of breaking input down 
 105  (called decomposition) can be reversed by composing the single entities to a 
 106  string. 
 107   
 108  Many L{ReadingOperator}s provide additional functions, each depending on the 
 109  characteristics of the implemented reading. For readings of tonal languages for 
 110  example they might allow to question the tone of the given reading of a 
 111  character. 
 112   
 113  G{classtree operator.ReadingOperator} 
 114   
 115  Converting between readings 
 116  --------------------------- 
 117  The second part provided are means to provide support for conversion between 
 118  different readings. 
 119   
 120  What all CJK languages seem to have in common is their irreversibility of the 
 121  mapping from a character to its reading, as these languages are rich in 
 122  homophones. Thus the highest degree in information for a text is obtained by the 
 123  pair of characters and their reading (aside from the meaning). 
 124   
 125  If one has a text written in reading A and one wants to obtain the text written 
 126  in B instead then it is not feasible to obtain the reading from the 
 127  corresponding characters even if present, as many characters have several 
 128  pronunciations. Instead one wants to convert the reading through conversion from 
 129  A to B. 
 130   
 131  Simple means to convert between readings is provided by classes implementing 
 132  L{ReadingConverter}. This conversion might neither be surjective nor injective, 
 133  and several L{exception}s can occur. 
 134   
 135  G{classtree converter.ReadingConverter} 
 136   
 137  Configurable X{Reading Dialect}s 
 138  -------------------------------- 
 139  Many readings come in specific representations even if standardised. This may 
 140  start with simple difference in type setting (e.g. punctuation) or include 
 141  special entities and derivatives. 
 142   
 143  Instead of selecting one default form as a global standard cjklib lets the user 
 144  choose the preferred dialect, though still trying to offer good default values. 
 145  It does so by offering a wide range of options for handling and conversion of 
 146  readings. These options can be given optionally in many places and are handed 
 147  down by the system to the component knowing about this specific configuration 
 148  option. Furthermore each class implements a method that states which options it 
 149  uses by default. 
 150   
 151  A special notion of X{dialect converters} is used for L{ReadingConverter}s that 
 152  convert between two different representations of the same reading. These allow 
 153  flexible switching between reading dialects. 
 154  @todo Fix:  Be independant on locale chosen, see 
 155      U{http://docs.python.org/library/locale.html#background-details-hints-tips-and-caveats}. 
 156  """ 
 157   
 158  __all__ = ['operator', 'converter', 'ReadingFactory'] 
 159   
 160  from cjklib.exception import UnsupportedError 
 161  from cjklib.dbconnector import DatabaseConnector 
 162  import operator 
 163  import converter 
164 165 -class ReadingFactory(object):
166 """ 167 Provides an abstract factory for creating L{ReadingOperator}s and 168 L{ReadingConverter}s and a façade to directly access the methods offered by 169 these classes. 170 171 Instances of other classes are cached in the background and reused on later 172 calls for methods accessed through the façade. 173 L{createReadingOperator()} and L{createReadingConverter} can be used to 174 create new instances for use outside of the ReadingFactory. 175 @todo Impl: What about hiding of inner classes? 176 L{_checkSpecialOperators()} method is called for internal converters and 177 for external ones delivered by L{createReadingConverter()}. Latter 178 method doesn't return internal cached copies though, but creates new 179 instances. L{ReadingOperator} also gets copies from ReadingFactory 180 objects for internal instances. Sharing saves memory but changing one 181 object will affect all other objects using this instance. 182 @todo Impl: General reading options given for a converter with **options 183 need to be used on creating a operator. How to raise errors to save user 184 of specifying an operator twice, one per options, one per concrete 185 instance (similar to sourceOptions and targetOptions)? 186 @todo Bug: Non standard reading options seem to be accepted when default in 187 converter: 188 189 >>> print f.convert('lao3shi1', 'Pinyin', 'MandarinIPA') 190 lau˨˩.ʂʅ˥˥ 191 """ 192 READING_OPERATORS = [operator.HangulOperator, operator.PinyinOperator, 193 operator.WadeGilesOperator, operator.GROperator, 194 operator.MandarinIPAOperator, operator.MandarinBrailleOperator, 195 operator.JyutpingOperator, operator.CantoneseYaleOperator, 196 operator.CantoneseIPAOperator, operator.HiraganaOperator, 197 operator.KatakanaOperator, operator.KanaOperator] 198 """A list of supported reading operators.""" 199 READING_CONVERTERS = [converter.PinyinDialectConverter, 200 converter.WadeGilesDialectConverter, converter.PinyinWadeGilesConverter, 201 converter.GRDialectConverter, converter.GRPinyinConverter, 202 converter.PinyinIPAConverter, converter.PinyinBrailleConverter, 203 converter.JyutpingDialectConverter, 204 converter.CantoneseYaleDialectConverter, 205 converter.JyutpingYaleConverter, converter.BridgeConverter] 206 """A list of supported reading converters. """ 207 208 sharedState = {'readingOperatorClasses': {}, 'readingConverterClasses': {}} 209 """ 210 Dictionary holding global state information used by all instances of the 211 ReadingFactory. 212 """ 213
214 - class SimpleReadingConverterAdaptor(object):
215 """ 216 Defines a simple converter between two I{character reading}s that keeps 217 the real converter doing the work in the background. 218 219 The basic method is L{convert()} which converts one input string from 220 one reading to another. In contrast to a L{ReadingConverter} no source 221 or target reading needs to be specified. 222 """
223 - def __init__(self, converterInst, fromReading, toReading):
224 """ 225 Creates an instance of the SimpleReadingConverterAdaptor. 226 227 @type converterInst: instance 228 @param converterInst: L{ReadingConverter} instance doing the actual 229 conversion work. 230 @type fromReading: str 231 @param fromReading: name of reading converted from 232 @type toReading: str 233 @param toReading: name of reading converted to 234 """ 235 self.converterInst = converterInst 236 self.fromReading = fromReading 237 self.toReading = toReading 238 self.CONVERSION_DIRECTIONS = [(fromReading, toReading)]
239
240 - def convert(self, string, fromReading=None, toReading=None):
241 """ 242 Converts a string in the source reading to the target reading. 243 244 If parameters fromReading or toReading are not given the class's 245 default values will be applied. 246 247 @type string: str 248 @param string: string written in the source reading 249 @type fromReading: str 250 @param fromReading: name of the source reading 251 @type toReading: str 252 @param toReading: name of the target reading 253 @rtype: str 254 @returns: the input string converted to the C{toReading} 255 @raise DecompositionError: if the string can not be decomposed into 256 basic entities with regards to the source reading. 257 @raise ConversionError: on operations specific to the conversion 258 between the two readings (e.g. error on converting entities). 259 @raise UnsupportedError: if source or target reading not supported 260 for conversion. 261 """ 262 if not fromReading: 263 fromReading = self.fromReading 264 if not toReading: 265 toReading = self.toReading 266 return self.converterInst.convert(string, fromReading, toReading)
267
268 - def convertEntities(self, readingEntities, fromReading=None, 269 toReading=None):
270 """ 271 Converts a list of entities in the source reading to the target 272 reading. 273 274 If parameters fromReading or toReading are not given the class's 275 default values will be applied. 276 277 @type readingEntities: list of str 278 @param readingEntities: list of entities written in source reading 279 @type fromReading: str 280 @param fromReading: name of the source reading 281 @type toReading: str 282 @param toReading: name of the target reading 283 @rtype: list of str 284 @return: list of entities written in target reading 285 @raise ConversionError: on operations specific to the conversion 286 between the two readings (e.g. error on converting entities). 287 @raise UnsupportedError: if source or target reading is not 288 supported for conversion. 289 @raise InvalidEntityError: if an invalid entity is given. 290 """ 291 if not fromReading: 292 fromReading = self.fromReading 293 if not toReading: 294 toReading = self.toReading 295 return self.converterInst.convertEntities(readingEntities, 296 fromReading, toReading)
297
298 - def __getattr__(self, name):
299 return getattr(self.converterInst, name)
300
301 - def __init__(self, databaseUrl=None, dbConnectInst=None):
302 """ 303 Initialises the ReadingFactory. 304 305 If no parameters are given default values are assumed for the connection 306 to the database. The database connection parameters can be given in 307 databaseUrl, or an instance of L{DatabaseConnector} can be passed in 308 dbConnectInst, the latter one being preferred if both are specified. 309 310 @type databaseUrl: str 311 @param databaseUrl: database connection setting in the format 312 C{driver://user:pass@host/database}. 313 @type dbConnectInst: instance 314 @param dbConnectInst: instance of a L{DatabaseConnector} 315 @bug: Specifying another database connector overwrites settings 316 of other instances. 317 """ 318 # rebind shared state variable to make it accessible to all instances 319 self.__dict__ = self.sharedState 320 # get connector to database 321 if dbConnectInst: 322 self.db = dbConnectInst 323 else: 324 self.db = DatabaseConnector.getDBConnector(databaseUrl) 325 # create object instance cache if needed, shared with all factories 326 # using the same database connection 327 if self.db not in self.sharedState: 328 self.sharedState[self.db] = {} 329 self.sharedState[self.db]['readingOperatorInstances'] = {} 330 self.sharedState[self.db]['readingConverterInstances'] = {} 331 # publish default reading operators and converters 332 for readingOperator in self.READING_OPERATORS: 333 self.publishReadingOperator(readingOperator) 334 for readingConverter in self.READING_CONVERTERS: 335 self.publishReadingConverter(readingConverter)
336 337 #{ Meta 338
339 - def publishReadingOperator(self, readingOperator):
340 """ 341 Publishes a L{ReadingOperator} to the list and thus makes it available 342 for other methods in the library. 343 344 @type readingOperator: classobj 345 @param readingOperator: a new L{ReadingOperator} to be published 346 """ 347 self.sharedState['readingOperatorClasses']\ 348 [readingOperator.READING_NAME] = readingOperator
349
350 - def getSupportedReadings(self):
351 """ 352 Gets a list of all supported readings. 353 354 @rtype: list of str 355 @return: a list of readings a L{ReadingOperator} is available for 356 """ 357 return self.sharedState['readingOperatorClasses'].keys()
358
359 - def getReadingOperatorClass(self, readingN):
360 """ 361 Gets the L{ReadingOperator}'s class for the given reading. 362 363 @type readingN: str 364 @param readingN: name of a supported reading 365 @rtype: classobj 366 @return: a L{ReadingOperator} class 367 @raise UnsupportedError: if the given reading is not supported. 368 """ 369 if readingN not in self.sharedState['readingOperatorClasses']: 370 raise UnsupportedError("reading '" + readingN + "' not supported") 371 return self.sharedState['readingOperatorClasses'][readingN]
372
373 - def createReadingOperator(self, readingN, **options):
374 """ 375 Creates an instance of a L{ReadingOperator} for the given reading. 376 377 @type readingN: str 378 @param readingN: name of a supported reading 379 @param options: options for the created instance 380 @rtype: instance 381 @return: a L{ReadingOperator} instance 382 @raise UnsupportedError: if the given reading is not supported. 383 """ 384 operatorClass = self.getReadingOperatorClass(readingN) 385 return operatorClass(dbConnectInst=self.db, **options)
386
387 - def publishReadingConverter(self, readingConverter):
388 """ 389 Publishes a L{ReadingConverter} to the list and thus makes it available 390 for other methods in the library. 391 392 @type readingConverter: classobj 393 @param readingConverter: a new L{readingConverter} to be published 394 """ 395 for fromReading, toReading in readingConverter.CONVERSION_DIRECTIONS: 396 self.sharedState['readingConverterClasses']\ 397 [(fromReading, toReading)] = readingConverter
398
399 - def getReadingConverterClass(self, fromReading, toReading):
400 """ 401 Gets the L{ReadingConverter}'s class for the given source and target 402 reading. 403 404 @type fromReading: str 405 @param fromReading: name of the source reading 406 @type toReading: str 407 @param toReading: name of the target reading 408 @rtype: classobj 409 @return: a L{ReadingConverter} class 410 @raise UnsupportedError: if conversion for the given readings is not 411 supported. 412 """ 413 if not self.isReadingConversionSupported(fromReading, toReading): 414 raise UnsupportedError("conversion from '" + fromReading \ 415 + "' to '" + toReading + "' not supported") 416 return self.sharedState['readingConverterClasses']\ 417 [(fromReading, toReading)]
418
419 - def createReadingConverter(self, fromReading, toReading, *args, **options):
420 """ 421 Creates an instance of a L{ReadingConverter} for the given source and 422 target reading and returns it wrapped as a 423 L{SimpleReadingConverterAdaptor}. 424 425 As L{ReadingConverter}s generally support more than one conversion 426 direction the user needs to specify which source and target reading is 427 needed on a regular instance. Wrapping the created instance in the 428 adaptor gives a simple convert() and convertEntities() routine, such 429 that on conversion the source and target readings don't have to be 430 specified. Other methods signatures remain unchanged. 431 432 @type fromReading: str 433 @param fromReading: name of the source reading 434 @type toReading: str 435 @param toReading: name of the target reading 436 @param args: optional list of L{RomanisationOperator}s to use for 437 handling source and target readings. 438 @param options: options for the created instance 439 @keyword hideComplexConverter: if true the L{ReadingConverter} is 440 wrapped as a L{SimpleReadingConverterAdaptor} (default). 441 @keyword sourceOperators: list of L{ReadingOperator}s used for handling 442 source readings. 443 @keyword targetOperators: list of L{ReadingOperator}s used for handling 444 target readings. 445 @keyword sourceOptions: dictionary of options to configure the 446 L{ReadingOperator}s used for handling source readings. If an 447 operator for the source reading is explicitly specified, no options 448 can be given. 449 @keyword targetOptions: dictionary of options to configure the 450 L{ReadingOperator}s used for handling target readings. If an 451 operator for the target reading is explicitly specified, no options 452 can be given. 453 @rtype: instance 454 @return: a L{SimpleReadingConverterAdaptor} or L{ReadingConverter} 455 instance 456 @raise UnsupportedError: if conversion for the given readings is not 457 supported. 458 """ 459 converterClass = self.getReadingConverterClass(fromReading, toReading) 460 461 self._checkSpecialOperators(fromReading, toReading, args, options) 462 463 converterInst = converterClass(dbConnectInst=self.db, *args, **options) 464 if 'hideComplexConverter' not in options \ 465 or options['hideComplexConverter']: 466 return ReadingFactory.SimpleReadingConverterAdaptor( 467 converterInst=converterInst, fromReading=fromReading, 468 toReading=toReading) 469 else: 470 return converterInst
471
472 - def isReadingConversionSupported(self, fromReading, toReading):
473 """ 474 Checks if the conversion from reading A to reading B is supported. 475 476 @rtype: bool 477 @return: true if conversion is supported, false otherwise 478 """ 479 return (fromReading, toReading) \ 480 in self.sharedState['readingConverterClasses']
481
482 - def getDefaultOptions(*args):
483 """ 484 Returns the default options for the L{ReadingOperator} or 485 L{ReadingConverter} applied for the given reading name or names 486 respectively. 487 488 The keyword 'dbConnectInst' is not regarded a configuration option and 489 is thus not included in the dict returned. 490 491 @raise ValueError: if more than one or two reading names are given. 492 @raise UnsupportedError: if no ReadingOperator or ReadingConverter 493 exists for the given reading or readings respectively. 494 """ 495 if len(args) == 1: 496 return self.getReadingOperatorClass(args[0]).getDefaultOptions() 497 elif len(args) == 2: 498 return self.getReadingConverterClass(args[0], args[1])\ 499 .getDefaultOptions() 500 else: 501 raise ValueError("Wrong number of arguments")
502
503 - def _getReadingOperatorInstance(self, readingN, **options):
504 """ 505 Returns an instance of a L{ReadingOperator} for the given reading from 506 the internal cache and creates it if it doesn't exist yet. 507 508 @type readingN: str 509 @param readingN: name of a supported reading 510 @param options: additional options for instance 511 @rtype: instance 512 @return: a L{ReadingOperator} instance 513 @raise UnsupportedError: if the given reading is not supported. 514 @todo Impl: Get all options when calculating key for an instance and use 515 the information on standard parameters thus minimising instances in 516 cache. Same for L{_getReadingConverterInstance()}. 517 """ 518 # construct key for lookup in cache 519 cacheKey = (readingN, self._getHashableCopy(options)) 520 # get cache 521 instanceCache = self.sharedState[self.db]['readingOperatorInstances'] 522 if cacheKey not in instanceCache: 523 operator = self.createReadingOperator(readingN, **options) 524 instanceCache[cacheKey] = operator 525 return instanceCache[cacheKey]
526
527 - def _getReadingConverterInstance(self, fromReading, toReading, *args, 528 **options):
529 """ 530 Returns an instance of a L{ReadingConverter} for the given source and 531 target reading from the internal cache and creates it if it doesn't 532 exist yet. 533 534 @type fromReading: str 535 @param fromReading: name of the source reading 536 @type toReading: str 537 @param toReading: name of the target reading 538 @param args: optional list of L{RomanisationOperator}s to use for 539 handling source and target readings. 540 @param options: additional options for instance 541 @keyword sourceOperators: list of L{ReadingOperator}s used for handling 542 source readings. 543 @keyword targetOperators: list of L{ReadingOperator}s used for handling 544 target readings. 545 @keyword sourceOptions: dictionary of options to configure the 546 L{ReadingOperator}s used for handling source readings. If an 547 operator for the source reading is explicitly specified, no options 548 can be given. 549 @keyword targetOptions: dictionary of options to configure the 550 L{ReadingOperator}s used for handling target readings. If an 551 operator for the target reading is explicitly specified, no options 552 can be given. 553 @rtype: instance 554 @return: an L{ReadingConverter} instance 555 @raise UnsupportedError: if conversion for the given readings are not 556 supported. 557 @todo Fix : Reusing of instances for other supported conversion 558 directions isn't that efficient if a special ReadingOperator is 559 specified for one direction, that doesn't affect others. 560 """ 561 self._checkSpecialOperators(fromReading, toReading, args, options) 562 563 # construct key for lookup in cache 564 cacheKey = ((fromReading, toReading), self._getHashableCopy(options)) 565 # get cache 566 instanceCache = self.sharedState[self.db]['readingConverterInstances'] 567 if cacheKey not in instanceCache: 568 conv = self.createReadingConverter(fromReading, toReading, 569 hideComplexConverter=False, *args, **options) 570 # use instance for all supported conversion directions 571 for convFromReading, convToReading in conv.CONVERSION_DIRECTIONS: 572 oCacheKey = ((convFromReading, convToReading), 573 self._getHashableCopy(options)) 574 if oCacheKey not in instanceCache: 575 instanceCache[oCacheKey] = conv 576 return instanceCache[cacheKey]
577
578 - def _checkSpecialOperators(self, fromReading, toReading, args, options):
579 """ 580 Checks for special operators requested for the given source and target 581 reading. 582 583 @type fromReading: str 584 @param fromReading: name of the source reading 585 @type toReading: str 586 @param toReading: name of the target reading 587 @param args: optional list of L{RomanisationOperator}s to use for 588 handling source and target readings. 589 @param options: additional options for handling the input 590 @keyword sourceOperators: list of L{ReadingOperator}s used for handling 591 source readings. 592 @keyword targetOperators: list of L{ReadingOperator}s used for handling 593 target readings. 594 @keyword sourceOptions: dictionary of options to configure the 595 L{ReadingOperator}s used for handling source readings. If an 596 operator for the source reading is explicitly specified, no options 597 can be given. 598 @keyword targetOptions: dictionary of options to configure the 599 L{ReadingOperator}s used for handling target readings. If an 600 operator for the target reading is explicitly specified, no options 601 can be given. 602 @raise ValueError: if options are given to create a specific 603 ReadingOperator, but an instance is already given in C{args}. 604 @raise UnsupportedError: if source or target reading is not supported. 605 """ 606 # check options, don't overwrite existing operators 607 for arg in args: 608 if isinstance(arg, ReadingOperator): 609 if arg.READING_NAME == fromReading \ 610 and 'sourceOptions' in options: 611 raise ValueError( 612 "source reading operator options given, " \ 613 + "but a source reading operator already exists") 614 if arg.READING_NAME == toReading \ 615 and 'targetOptions' in options: 616 raise ValueError( 617 "target reading operator options given, " \ 618 + "but a target reading operator already exists") 619 # create operators for options 620 if 'sourceOptions' in options: 621 readingOp = self._getReadingOperatorInstance(fromReading, 622 **options['sourceOptions']) 623 del options['sourceOptions'] 624 625 # add reading operator to converter 626 if 'sourceOperators' not in options: 627 options['sourceOperators'] = [] 628 options['sourceOperators'].append(readingOp) 629 630 if 'targetOptions' in options: 631 readingOp = self._getReadingOperatorInstance(toReading, 632 **options['targetOptions']) 633 del options['targetOptions'] 634 635 # add reading operator to converter 636 if 'targetOperators' not in options: 637 options['targetOperators'] = [] 638 options['targetOperators'].append(readingOp)
639 640 @staticmethod
641 - def _getHashableCopy(data):
642 """ 643 Constructs a unique hashable (partially deep-)copy for a given instance, 644 replacing non-hashable datatypes C{set}, C{dict} and C{list} 645 recursively. 646 647 @param data: non-hashable object 648 @return: hashable object, C{set} converted to a C{frozenset}, C{dict} 649 converted to a C{frozenset} of key-value-pairs (tuple), and C{list} 650 converted to a C{tuple}. 651 """ 652 if type(data) == type([]) or type(data) == type(()): 653 newList = [] 654 for entry in data: 655 newList.append(ReadingFactory._getHashableCopy(entry)) 656 return tuple(newList) 657 elif type(data) == type(set([])): 658 newSet = set([]) 659 for entry in data: 660 newSet.add(ReadingFactory._getHashableCopy(entry)) 661 return frozenset(newSet) 662 elif type(data) == type({}): 663 newDict = {} 664 for key in data: 665 newDict[key] = ReadingFactory._getHashableCopy(data[key]) 666 return frozenset(newDict.items()) 667 else: 668 return data
669 670 #} 671 #{ ReadingConverter methods 672
673 - def convert(self, readingStr, fromReading, toReading, *args, **options):
674 """ 675 Converts the given string in the source reading to the given target 676 reading. 677 678 @type readingStr: str 679 @param readingStr: string that needs to be converted 680 @type fromReading: str 681 @param fromReading: name of the source reading 682 @type toReading: str 683 @param toReading: name of the target reading 684 @param args: optional list of L{RomanisationOperator}s to use for 685 handling source and target readings. 686 @param options: additional options for handling the input 687 @keyword sourceOperators: list of L{ReadingOperator}s used for handling 688 source readings. 689 @keyword targetOperators: list of L{ReadingOperator}s used for handling 690 target readings. 691 @keyword sourceOptions: dictionary of options to configure the 692 L{ReadingOperator}s used for handling source readings. If an 693 operator for the source reading is explicitly specified, no options 694 can be given. 695 @keyword targetOptions: dictionary of options to configure the 696 L{ReadingOperator}s used for handling target readings. If an 697 operator for the target reading is explicitly specified, no options 698 can be given. 699 @rtype: str 700 @return: the converted string 701 @raise DecompositionError: if the string can not be decomposed into 702 basic entities with regards to the source reading or the given 703 information is insufficient. 704 @raise ConversionError: on operations specific to the conversion between 705 the two readings (e.g. error on converting entities). 706 @raise UnsupportedError: if source or target reading is not supported 707 for conversion. 708 """ 709 readingConv = self._getReadingConverterInstance(fromReading, toReading, 710 *args, **options) 711 return readingConv.convert(readingStr, fromReading, toReading)
712
713 - def convertEntities(self, readingEntities, fromReading, toReading, *args, 714 **options):
715 """ 716 Converts a list of entities in the source reading to the given target 717 reading. 718 719 @type readingEntities: list of str 720 @param readingEntities: list of entities written in source reading 721 @type fromReading: str 722 @param fromReading: name of the source reading 723 @type toReading: str 724 @param toReading: name of the target reading 725 @param args: optional list of L{RomanisationOperator}s to use for 726 handling source and target readings. 727 @param options: additional options for handling the input 728 @keyword sourceOperators: list of L{ReadingOperator}s used for handling 729 source readings. 730 @keyword targetOperators: list of L{ReadingOperator}s used for handling 731 target readings. 732 @keyword sourceOptions: dictionary of options to configure the 733 L{ReadingOperator}s used for handling source readings. If an 734 operator for the source reading is explicitly specified, no options 735 can be given. 736 @keyword targetOptions: dictionary of options to configure the 737 L{ReadingOperator}s used for handling target readings. If an 738 operator for the target reading is explicitly specified, no options 739 can be given. 740 @rtype: list of str 741 @return: list of entities written in target reading 742 @raise ConversionError: on operations specific to the conversion between 743 the two readings (e.g. error on converting entities). 744 @raise UnsupportedError: if source or target reading is not supported 745 for conversion. 746 @raise InvalidEntityError: if an invalid entity is given. 747 """ 748 readingConv = self._getReadingConverterInstance(fromReading, toReading, 749 *args, **options) 750 return readingConv.convertEntities(readingEntities, fromReading, 751 toReading)
752 753 #} 754 #{ ReadingOperator methods 755
756 - def decompose(self, string, readingN, **options):
757 """ 758 Decomposes the given string into basic entities that can be mapped to 759 one Chinese character each for the given reading. 760 761 The given input string can contain other non reading characters, e.g. 762 punctuation marks. 763 764 The returned list contains a mix of basic reading entities and other 765 characters e.g. spaces and punctuation marks. 766 767 @type string: str 768 @param string: reading string 769 @type readingN: str 770 @param readingN: name of reading 771 @param options: additional options for handling the input 772 @rtype: list of str 773 @return: a list of basic entities of the input string 774 @raise DecompositionError: if the string can not be decomposed. 775 @raise UnsupportedError: if the given reading is not supported. 776 """ 777 readingOp = self._getReadingOperatorInstance(readingN, **options) 778 return readingOp.decompose(string)
779
780 - def compose(self, readingEntities, readingN, **options):
781 """ 782 Composes the given list of basic entities to a string for the given 783 reading. 784 785 @type readingEntities: list of str 786 @param readingEntities: list of basic syllables or other content 787 @type readingN: str 788 @param readingN: name of reading 789 @param options: additional options for handling the input 790 @rtype: str 791 @return: composed entities 792 @raise UnsupportedError: if the given reading is not supported. 793 """ 794 readingOp = self._getReadingOperatorInstance(readingN, **options) 795 return readingOp.compose(readingEntities)
796
797 - def isReadingEntity(self, entity, readingN, **options):
798 """ 799 Checks if the given string is an entity of the given reading. 800 801 @type entity: str 802 @param entity: entity to check 803 @type readingN: str 804 @param readingN: name of reading 805 @param options: additional options for handling the input 806 @rtype: bool 807 @return: true if string is an entity of the reading, false otherwise. 808 @raise UnsupportedError: if the given reading is not supported. 809 """ 810 readingOp = self._getReadingOperatorInstance(readingN, **options) 811 return readingOp.isReadingEntity(entity)
812 813 #} 814 #{ RomanisationOperator methods 815
816 - def getDecompositions(self, string, readingN, **options):
817 """ 818 Decomposes the given string into basic entities that can be mapped to 819 one Chinese character each for ambiguous decompositions. It all possible 820 decompositions. This method is a more general version of L{decompose()}. 821 822 The returned list construction consists of two entity types: entities of 823 the romanisation and other strings. 824 825 @type string: str 826 @param string: reading string 827 @type readingN: str 828 @param readingN: name of reading 829 @param options: additional options for handling the input 830 @rtype: list of list of str 831 @return: a list of all possible decompositions consisting of basic 832 entities. 833 @raise DecompositionError: if the given string has a wrong format. 834 @raise UnsupportedError: if the given reading is not supported or the 835 reading doesn't support the specified method. 836 """ 837 readingOp = self._getReadingOperatorInstance(readingN, **options) 838 if not hasattr(readingOp, 'getDecompositions'): 839 raise UnsupportedError("method 'getDecompositions' not supported") 840 return readingOp.getDecompositions(string)
841
842 - def segment(self, string, readingN, **options):
843 """ 844 Takes a string written in the romanisation and returns the possible 845 segmentations as a list of syllables. 846 847 In contrast to L{decompose()} this method merely segments continuous 848 entities of the romanisation. Characters not part of the romanisation 849 will not be dealt with, this is the task of the more general decompose 850 method. 851 852 @type string: str 853 @param string: reading string 854 @type readingN: str 855 @param readingN: name of reading 856 @param options: additional options for handling the input 857 @rtype: list of list of str 858 @return: a list of possible segmentations (several if ambiguous) into 859 single syllables 860 @raise DecompositionError: if the given string has an invalid format. 861 @raise UnsupportedError: if the given reading is not supported or the 862 reading doesn't support the specified method. 863 """ 864 readingOp = self._getReadingOperatorInstance(readingN, **options) 865 if not hasattr(readingOp, 'segment'): 866 raise UnsupportedError("method 'segment' not supported") 867 return readingOp.segment(string)
868
869 - def isStrictDecomposition(self, decomposition, readingN, **options):
870 """ 871 Checks if the given decomposition follows the romanisation format 872 strictly to allow unambiguous decomposition. 873 874 The romanisation should offer a way/protocol to make an unambiguous 875 decomposition into it's basic syllables possible as to make the process 876 of appending syllables to a string reversible. The testing on compliance 877 with this protocol has to be implemented here. Thus this method can only 878 return true for one and only one possible decomposition for all strings. 879 880 @type decomposition: list of str 881 @param decomposition: decomposed reading string 882 @type readingN: str 883 @param readingN: name of reading 884 @param options: additional options for handling the input 885 @rtype: bool 886 @return: False, as this methods needs to be implemented by the sub class 887 @raise UnsupportedError: if the given reading is not supported or the 888 reading doesn't support the specified method. 889 """ 890 readingOp = self._getReadingOperatorInstance(readingN, **options) 891 if not hasattr(readingOp, 'isStrictDecomposition'): 892 raise UnsupportedError( 893 "method 'isStrictDecomposition' not supported") 894 return readingOp.isStrictDecomposition(decomposition)
895
896 - def getReadingEntities(self, readingN, **options):
897 """ 898 Gets a set of all entities supported by the reading. 899 900 The list is used in the segmentation process to find entity boundaries. 901 902 @type readingN: str 903 @param readingN: name of reading 904 @param options: additional options for handling the input 905 @rtype: set of str 906 @return: set of supported syllables 907 @raise UnsupportedError: if the given reading is not supported or the 908 reading doesn't support the specified method. 909 """ 910 readingOp = self._getReadingOperatorInstance(readingN, **options) 911 if not hasattr(readingOp, 'getReadingEntities'): 912 raise UnsupportedError("method 'getReadingEntities' not supported") 913 return readingOp.getReadingEntities()
914 915 #} 916 #{ TonalRomanisationOperator methods 917
918 - def getTones(self, readingN, **options):
919 """ 920 Returns a set of tones supported by the reading. 921 922 @type readingN: str 923 @param readingN: name of reading 924 @param options: additional options for handling the input 925 @rtype: list 926 @return: list of supported tone marks. 927 @raise UnsupportedError: if the given reading is not supported or the 928 reading doesn't support the specified method. 929 """ 930 readingOp = self._getReadingOperatorInstance(readingN, **options) 931 if not hasattr(readingOp, 'getTones'): 932 raise UnsupportedError("method 'getTones' not supported") 933 return readingOp.getTones()
934
935 - def getTonalEntity(self, plainEntity, tone, readingN, **options):
936 """ 937 Gets the entity with tone mark for the given plain entity and tone. 938 939 @type plainEntity: str 940 @param plainEntity: entity without tonal information 941 @param tone: tone 942 @type readingN: str 943 @param readingN: name of reading 944 @param options: additional options for handling the input 945 @rtype: str 946 @return: entity with appropriate tone 947 @raise InvalidEntityError: if the entity is invalid. 948 @raise UnsupportedError: if the given reading is not supported or the 949 reading doesn't support the specified method. 950 """ 951 readingOp = self._getReadingOperatorInstance(readingN, **options) 952 if not hasattr(readingOp, 'getTonalEntity'): 953 raise UnsupportedError("method 'getTonalEntity' not supported") 954 return readingOp.getTonalEntity(plainEntity, tone)
955
956 - def splitEntityTone(self, entity, readingN, **options):
957 """ 958 Splits the entity into an entity without tone mark (plain entity) and 959 the entity's tone. 960 961 @type entity: str 962 @param entity: entity with tonal information 963 @type readingN: str 964 @param readingN: name of reading 965 @param options: additional options for handling the input 966 @rtype: tuple 967 @return: plain entity without tone mark and entity's tone 968 @raise InvalidEntityError: if the entity is invalid. 969 @raise UnsupportedError: if the given reading is not supported or the 970 reading doesn't support the specified method. 971 """ 972 readingOp = self._getReadingOperatorInstance(readingN, **options) 973 if not hasattr(readingOp, 'splitEntityTone'): 974 raise UnsupportedError("method 'splitEntityTone' not supported") 975 return readingOp.splitEntityTone(entity)
976
977 - def getPlainReadingEntities(self, readingN, **options):
978 """ 979 Gets the list of plain entities supported by this reading. Different to 980 L{getReadingEntities()} the entities will carry no tone mark. 981 982 @type readingN: str 983 @param readingN: name of reading 984 @param options: additional options for handling the input 985 @rtype: set of str 986 @return: set of supported syllables 987 @raise UnsupportedError: if the given reading is not supported or the 988 reading doesn't support the specified method. 989 """ 990 readingOp = self._getReadingOperatorInstance(readingN, **options) 991 if not hasattr(readingOp, 'getPlainReadingEntities'): 992 raise UnsupportedError( 993 "method 'getPlainReadingEntities' not supported") 994 return readingOp.getPlainReadingEntities()
995
996 - def isPlainReadingEntity(self, entity, readingN, **options):
997 """ 998 Returns true if the given plain entity (without any tone mark) is 999 recognised by the romanisation operator, i.e. it is a valid entity of 1000 the reading returned by the segmentation method. 1001 1002 Reading entities will be handled as being case insensitive. 1003 1004 @type entity: str 1005 @param entity: entity to check 1006 @type readingN: str 1007 @param readingN: name of reading 1008 @param options: additional options for handling the input 1009 @rtype: bool 1010 @return: C{True} if string is an entity of the reading, C{False} 1011 otherwise. 1012 @raise UnsupportedError: if the given reading is not supported or the 1013 reading doesn't support the specified method. 1014 """ 1015 readingOp = self._getReadingOperatorInstance(readingN, **options) 1016 if not hasattr(readingOp, 'isPlainReadingEntity'): 1017 raise UnsupportedError( 1018 "method 'isPlainReadingEntity' not supported") 1019 return readingOp.isPlainReadingEntity(entity)
1020