Package cjklib :: Package reading :: Module operator :: Class PinyinOperator
[hide private]
[frames] | no frames]

Class PinyinOperator

source code


Provides an operator for the Mandarin romanisation Hanyu Pinyin. It can be configured to cope with different representations ("dialects") of Pinyin. For conversion between different representations the PinyinDialectConverter can be used.

Features:

Apostrophes

Pinyin syllables need to be separated by an apostrophe in case their decomposition will get ambiguous. A famous example might be the city Xi'an, which if written xian would be read as one syllable, meaning e.g. 'fresh'. Another example would be Chang'an which could be read chan'gan if no delimiter is used in at least one of both cases.

Different rules exist where to place apostrophes. A simple yet sufficient rule is implemented in aeoApostropheRule() which is used as default in this class. Syllables starting with one of the three vowels a, e, o will be separated. Remember that vowels [i], [u], [y] are represented as yi, wu, yu respectively, thus making syllable boundaries clear. compose() will place apostrophes where required when composing the reading string.

An alternative rule can be specified to the constructor passing a function as an option PinyinApostropheFunction. A possible function could be a rule separating all syllables by an apostrophe thus simplifying the reading process for beginners.

On decomposition of strings it is important to check which of the possibly several choices will be the one actually meant. E.g. syllable xian given above should always be segmented into one syllable, solution xi'an is not an option in this case. Therefore an alternative to aeoApostropheRule() should make sure it guarantees proper decomposition, which is tested through isStrictDecomposition().

Last but not least compose(decompose(string)) will only be the identity if apostrophes are applied properly according to the rule as wrongly placed apostrophes will be kept when composing. Use removeApostrophes() to remove separating apostrophes.

Example

>>> def noToneApostropheRule(precedingEntity, followingEntity):
...     return precedingEntity and precedingEntity[0].isalpha() \
...         and not precedingEntity[-1].isdigit() \
...         and followingEntity[0].isalpha()
...
>>> from cjklib.reading import ReadingFactory
>>> f = ReadingFactory()
>>> f.convert('an3ma5mi5ba5ni2mou1', 'Pinyin', 'Pinyin',
...     sourceOptions={'toneMarkType': 'Numbers'},
...     targetOptions={'toneMarkType': 'Numbers',
...         'missingToneMark': 'fifth',
...         'PinyinApostropheFunction': noToneApostropheRule})
u"an3ma'mi'ba'ni2mou1"

R-colouring

The phenomenon Erhua (兒化音/儿化音, Erhua yin), i.e. the r-colouring of syllables, is found in the northern Chinese dialects and results from merging the formerly independent sound er with the preceding syllable. In written form a word is followed by the character 兒/儿, e.g. 頭兒/头儿.

In Pinyin the Erhua sound is quite often expressed by appending a single r to the syllable of the character preceding 兒/儿, e.g. tóur for 頭兒/头儿, to stress the monosyllabic nature and in contrast to words like 兒子/儿子 ér'zi where 兒/儿 ér constitutes a single syllable.

For decomposing syllables in Pinyin it is thus important to decide if the r marking r-colouring should be an entity on its own account stressing the representation in the character string with an own character or rather stressing the monosyllabic nature and being part of a syllable of the foregoing character. This can be configured once instantiation.

Source


See Also:

To Do (Impl):
Instance Methods [hide private]
 
__init__(self, **options)
Creates an instance of the PinyinOperator.
source code
list
getTones(self)
Returns a set of tones supported by the reading.
source code
str
compose(self, readingEntities)
Composes the given list of basic entities to a string.
source code
list of str
removeApostrophes(self, readingEntities)
Removes apostrophes between two syllables for a given decomposition.
source code
bool
aeoApostropheRule(self, precedingEntity, followingEntity)
Checks if the given entities need to be separated by an apostrophe.
source code
bool
isStrictDecomposition(self, readingEntities)
Checks if the given decomposition follows the Pinyin format strictly for unambiguous decomposition: syllables have to be preceded by an apostrophe if the decomposition would be ambiguous otherwise.
source code
str
getTonalEntity(self, plainEntity, tone)
Gets the entity with tone mark for the given plain entity and tone.
source code
str
_placeNucleusToneMark(self, nucleus, tone)
Places a tone mark on the given syllable nucleus according to the rules of the Pinyin standard.
source code
tuple
splitEntityTone(self, entity)
Splits the entity into an entity without tone mark and the entity's tone index.
source code
set of str
getPlainReadingEntities(self)
Gets the list of plain entities supported by this reading.
source code
list of str
getReadingEntities(self)
Gets a set of all entities supported by the reading.
source code
tuple of str
getOnsetRhyme(self, plainSyllable)
Splits the given plain syllable into onset (initial) and rhyme (final).
source code

Inherited from TonalRomanisationOperator: isPlainReadingEntity, isReadingEntity

Inherited from RomanisationOperator: decompose, getDecompositionTree, getDecompositions, segment

Inherited from ReadingOperator: getOption

Inherited from object: __delattr__, __getattribute__, __hash__, __new__, __reduce__, __reduce_ex__, __repr__, __setattr__, __str__

Class Methods [hide private]
dict
getDefaultOptions(cls)
Returns the reading operator's default options.
source code
dict
guessReadingDialect(cls, string, includeToneless=False)
Takes a string written in Pinyin and guesses the reading dialect.
source code
Static Methods [hide private]
list of str
_getDiacriticVowels()
Gets a list of Pinyin vowels with diacritical marks for tones.
source code

Inherited from RomanisationOperator (private): _crossProduct, _treeToList

Class Variables [hide private]
  READING_NAME = 'Pinyin'
Unique name of reading
  TONEMARK_VOWELS = [u'a', u'e', u'i', u'o', u'u', u'ü', u'n', u...
List of characters of the nucleus possibly carrying the tone mark.
  TONEMARK_MAP = {u'̀': 4, u'́': 2, u'̄': 1, u'̌': 3}
Mapping of Combining Diacritical Marks to their Pinyin tone index.
  PINYIN_SOUND_REGEX = re.compile(r'(?i)^([^aeiuo\xfc]*)([aeiuo\...
Regular Expression matching onset, nucleus and coda.
  toneMarkRegex = re.compile(r'[\u0301\u0300\u0304\u030c]')
Regular Expression matching the Pinyin tone marks.
  tonemarkMapReverse = {1: u'̄', 2: u'́', 3: u'̌', 4: u'̀'}

Inherited from RomanisationOperator: readingEntityRegex

Properties [hide private]

Inherited from object: __class__

Method Details [hide private]

__init__(self, **options)
(Constructor)

source code 

Creates an instance of the PinyinOperator.

The class instance can be configured by different optional options given as keywords.

Parameters:
  • options - extra options
  • dbConnectInst - instance of a DatabaseConnector, if none is given, default settings will be assumed.
  • strictSegmentation - if True segmentation (using segment()) and thus decomposition (using decompose()) will raise an exception if an alphabetic string is parsed which can not be segmented into single reading entities. If False the aforesaid string will be returned unsegmented.
  • toneMarkType - if set to 'Diacritics' tones will be marked using diacritic marks, if set to 'Numbers' appended numbers from 1 to 5 will be used to mark tones, if set to 'None' no tone marks will be used and no tonal information will be supplied at all.
  • missingToneMark - if set to 'fifth' no tone mark is set to indicate the fifth tone (qingsheng, e.g. 'wo3men' stands for 'wo3men5'), if set to 'noinfo', no tone information will be deduced when no tone mark is found (takes on value None), if set to 'ignore' this entity will not be valid and for segmentation the behaviour defined by 'strictSegmentation' will take affect. This option is only valid for the tone mark type 'Numbers'.
  • yVowel - a character (or string) that is taken as alternative for ü which depicts (among others) the close front rounded vowel [y] (IPA) in Pinyin and includes an umlaut. Changes forms of syllables nü, nüe, lü, lüe. This option is not valid for the tone mark type 'Diacritics'.
  • PinyinApostrophe - an alternate apostrophe that is taken instead of the default one.
  • PinyinApostropheFunction - a function that indicates when a syllable combination needs to be split by an apostrophe, see aeoApostropheRule() for the default implementation.
  • Erhua - if set to 'ignore' no special support will be provided for retroflex -r at syllable end (Erhua), i.e. zher will raise an exception. If set to 'twoSyllables' syllables with an append r are given/will be segmented into two syllables, the -r suffix making up one syllable itself as 'r'. If set to 'oneSyllable' syllables with an appended r are given/will be segmented into one syllable only.
Overrides: object.__init__

getDefaultOptions(cls)
Class Method

source code 

Returns the reading operator's default options.

The default implementation returns an empty dictionary. The keyword 'dbConnectInst' is not regarded a configuration option of the operator and is thus not included in the dict returned.

Returns: dict
the reading operator's default options.
Overrides: ReadingOperator.getDefaultOptions
(inherited documentation)

_getDiacriticVowels()
Static Method

source code 

Gets a list of Pinyin vowels with diacritical marks for tones.

The alternative for vowel ü does not need diacritical forms as the standard form doesn't allow changing the vowel.

Returns: list of str
list of Pinyin vowels with diacritical marks

guessReadingDialect(cls, string, includeToneless=False)
Class Method

source code 

Takes a string written in Pinyin and guesses the reading dialect.

The basic options 'toneMarkType', 'yVowel' and 'Erhua' are guessed. Unless 'includeToneless' is set to True only the tone mark types 'Diacritics' and 'Numbers' are considered as the latter one can also represent the state of missing tones. Strings tested for 'yVowel' are ü, v and u:. 'Erhua' is set to 'twoSyllables' by default and only tested when 'toneMarkType' is assumed to be set to 'Numbers'.

Parameters:
  • string (str) - Pinyin string
Returns: dict
dictionary of basic keyword settings

getTones(self)

source code 

Returns a set of tones supported by the reading. These tones don't necessarily reflect the tones of the underlying language but may defer to reflect notational or other features.

The default implementation will raise a NotImplementedError.

Returns: list
list of supported tone marks.
Overrides: TonalFixedEntityOperator.getTones
(inherited documentation)

compose(self, readingEntities)

source code 

Composes the given list of basic entities to a string. Applies an apostrophe between syllables if needed using default implementation aeoApostropheRule().

Parameters:
  • readingEntities (list of str) - list of basic syllables or other content
Returns: str
composed entities
Overrides: ReadingOperator.compose

removeApostrophes(self, readingEntities)

source code 

Removes apostrophes between two syllables for a given decomposition.

Parameters:
  • readingEntities (list of str) - list of basic syllables or other content
Returns: list of str
the given entity list without separating apostrophes

aeoApostropheRule(self, precedingEntity, followingEntity)

source code 

Checks if the given entities need to be separated by an apostrophe.

Returns true for syllables starting with one of the three vowels a, e, o having a preceding syllable. Additionally forms n and ng are separated from preceding syllables. Furthermore corner case e'r will handled to distinguish from er.

This function serves as the default apostrophe rule.

Parameters:
  • precedingEntity (str) - the preceding syllable or any other content
  • followingEntity (str) - the following syllable or any other content
Returns: bool
true if the syllables need to be separated, false otherwise

isStrictDecomposition(self, readingEntities)

source code 

Checks if the given decomposition follows the Pinyin format strictly for unambiguous decomposition: syllables have to be preceded by an apostrophe if the decomposition would be ambiguous otherwise.

The function stored given as option 'PinyinApostropheFunction' is used to check if a apostrophe should have been placed.

Parameters:
  • readingEntities (list of str) - decomposed reading string
Returns: bool
true if decomposition is strict, false otherwise
Overrides: RomanisationOperator.isStrictDecomposition

getTonalEntity(self, plainEntity, tone)

source code 

Gets the entity with tone mark for the given plain entity and tone.

The default implementation will raise a NotImplementedError.

Parameters:
  • plainEntity - entity without tonal information
  • tone - tone
Returns: str
entity with appropriate tone
Raises:
Overrides: TonalFixedEntityOperator.getTonalEntity
(inherited documentation)

_placeNucleusToneMark(self, nucleus, tone)

source code 

Places a tone mark on the given syllable nucleus according to the rules of the Pinyin standard.

Parameters:
  • nucleus (str) - syllable nucleus
  • tone (int) - tone index (starting with 1)
Returns: str
nucleus with appropriate tone

See Also: Pinyin.info - Where do the tone marks go?, http://www.pinyin.info/rules/where.html.

splitEntityTone(self, entity)

source code 

Splits the entity into an entity without tone mark and the entity's tone index.

The plain entity returned will always be in Unicode's Normalization Form C (NFC, see http://www.unicode.org/reports/tr15/).

Parameters:
  • entity (str) - entity with tonal information
Returns: tuple
plain entity without tone mark and entity's tone index (starting with 1)
Raises:
Overrides: TonalFixedEntityOperator.splitEntityTone

getPlainReadingEntities(self)

source code 

Gets the list of plain entities supported by this reading. Different to getReadingEntities() the entities will carry no tone mark.

Depending on the type of Erhua support either additional syllables with an ending -r are added, or a single r is included. The user specified character for vowel ü will be used.

Returns: set of str
set of supported syllables
Overrides: TonalFixedEntityOperator.getPlainReadingEntities

getReadingEntities(self)

source code 

Gets a set of all entities supported by the reading.

The list is used in the segmentation process to find entity boundaries.

Returns: list of str
list of supported syllables
Overrides: TonalFixedEntityOperator.getReadingEntities
(inherited documentation)

getOnsetRhyme(self, plainSyllable)

source code 

Splits the given plain syllable into onset (initial) and rhyme (final).

Pinyin can't be separated into onset and rhyme clearly within its own system. There are syllables with same finals written differently (e.g. wei and dui both ending in a final that can be described by uei) and reduction of vowels (same example: dui which is pronounced with vowels uei). This method will use three forms not found as substrings in Pinyin (uei, {uen} and iou) and substitutes (pseudo) initials w and y with its vowel equivalents.

Furthermore final i will be distinguished in three forms given by the following three examples: yi, zhi and zi to express phonological difference.

Parameters:
  • plainSyllable (str) - syllable without tone marks
Returns: tuple of str
tuple of entity onset and rhyme
Raises:

Class Variable Details [hide private]

TONEMARK_VOWELS

List of characters of the nucleus possibly carrying the tone mark. n is included in standalone syllables n and ng. r is used for supporting Erhua in a two syllable form.

Value:
[u'a', u'e', u'i', u'o', u'u', u'ü', u'n', u'm', u'r', u'ê']

TONEMARK_MAP

Mapping of Combining Diacritical Marks to their Pinyin tone index.

See Also:

Value:
{u'̀': 4, u'́': 2, u'̄': 1, u'̌': 3}

PINYIN_SOUND_REGEX

Regular Expression matching onset, nucleus and coda. Syllables 'n', 'ng', 'r' (for Erhua) and 'ê' have to be handled separately.

Value:
re.compile(r'(?i)^([^aeiuo\xfc]*)([aeiuo\xfc]*)([^aeiuo\xfc]*)$')