Package cjklib :: Package reading :: Module operator :: Class PinyinOperator

Class PinyinOperator

Provides an operator for the Mandarin romanisation Hanyu Pinyin. It can be configured to cope with different representations ("dialects") of Pinyin. For conversion between different representations the PinyinDialectConverter can be used.

Features:

tones marked by either diacritics or numbers,
alternative representation of ü-character,
correct placement of apostrophes,
guessing of input form (reading dialect),
support for Erhua and
splitting of syllables into onset and rhyme.

Apostrophes

Pinyin syllables need to be separated by an apostrophe in case their decomposition will get ambiguous. A famous example might be the city Xi'an, which if written xian would be read as one syllable, meaning e.g. 'fresh'. Another example would be Chang'an which could be read chan'gan if no delimiter is used in at least one of both cases.

Different rules exist where to place apostrophes. A simple yet sufficient rule is implemented in aeoApostropheRule() which is used as default in this class. Syllables starting with one of the three vowels a, e, o will be separated. Remember that vowels [i], [u], [y] are represented as yi, wu, yu respectively, thus making syllable boundaries clear. compose() will place apostrophes where required when composing the reading string.

An alternative rule can be specified to the constructor passing a function as an option PinyinApostropheFunction. A possible function could be a rule separating all syllables by an apostrophe thus simplifying the reading process for beginners.

On decomposition of strings it is important to check which of the possibly several choices will be the one actually meant. E.g. syllable xian given above should always be segmented into one syllable, solution xi'an is not an option in this case. Therefore an alternative to aeoApostropheRule() should make sure it guarantees proper decomposition, which is tested through isStrictDecomposition().

Last but not least compose(decompose(string)) will only be the identity if apostrophes are applied properly according to the rule as wrongly placed apostrophes will be kept when composing. Use removeApostrophes() to remove separating apostrophes.

Example

>>> def noToneApostropheRule(precedingEntity, followingEntity):
...     return precedingEntity and precedingEntity[0].isalpha() \
...         and not precedingEntity[-1].isdigit() \
...         and followingEntity[0].isalpha()
...
>>> from cjklib.reading import ReadingFactory
>>> f = ReadingFactory()
>>> f.convert('an3ma5mi5ba5ni2mou1', 'Pinyin', 'Pinyin',
...     sourceOptions={'toneMarkType': 'Numbers'},
...     targetOptions={'toneMarkType': 'Numbers',
...         'missingToneMark': 'fifth',
...         'PinyinApostropheFunction': noToneApostropheRule})
u"an3ma'mi'ba'ni2mou1"

R-colouring

The phenomenon Erhua (兒化音/儿化音, Erhua yin), i.e. the r-colouring of syllables, is found in the northern Chinese dialects and results from merging the formerly independent sound er with the preceding syllable. In written form a word is followed by the character 兒/儿, e.g. 頭兒/头儿.

In Pinyin the Erhua sound is quite often expressed by appending a single r to the syllable of the character preceding 兒/儿, e.g. tóur for 頭兒/头儿, to stress the monosyllabic nature and in contrast to words like 兒子/儿子 ér'zi where 兒/儿 ér constitutes a single syllable.

For decomposing syllables in Pinyin it is thus important to decide if the r marking r-colouring should be an entity on its own account stressing the representation in the character string with an own character or rather stressing the monosyllabic nature and being part of a syllable of the foregoing character. This can be configured once instantiation.

Source

Yǐn Bīnyōng (尹斌庸), Mary Felley (傅曼丽): Chinese romanization: Pronunciation and Orthography (汉语拼音和正词法). Sinolingua, Beijing, 1990, ISBN 7-80052-148-6, ISBN 0-8351-1930-0.

See Also:

To Do (Impl):

ISO 7098 asks for conversion of 。、·「」 to .,-«». What about ，？《》：－? Implement a method for conversion to be optionally used.
Strict testing of tone mark placement. Currently it doesn't matter where tones are placed. All combinations are recognised.
Special marker for neutral tone: 'mȧ' (u'mȧ', reported by Ching-song Gene Hsiao: A Manual of Transcription Systems For Chinese, 中文拼音手册. Far Eastern Publications, Yale University, New Haven, Connecticut, 1985, ISBN 0-88710-141-0.), and '·ma' (u'\xb7ma', check!: 现代汉语词典（第5版）[Xiàndài Hànyǔ Cídiǎn 5. Edition]. 商务印书馆 [Shāngwù Yìnshūguǎn], Beijing, 2005, ISBN 7-100-04385-9.)

Instance Methods

[hide private]

__init__(self, **options)
Creates an instance of the PinyinOperator.

source code

list

getTones(self)
Returns a set of tones supported by the reading.

source code

str

compose(self, readingEntities)
Composes the given list of basic entities to a string.

source code

list of str

removeApostrophes(self, readingEntities)
Removes apostrophes between two syllables for a given decomposition.

source code

bool

aeoApostropheRule(self, precedingEntity, followingEntity)
Checks if the given entities need to be separated by an apostrophe.

source code

bool

isStrictDecomposition(self, readingEntities)
Checks if the given decomposition follows the Pinyin format strictly for unambiguous decomposition: syllables have to be preceded by an apostrophe if the decomposition would be ambiguous otherwise.

source code

str

getTonalEntity(self, plainEntity, tone)
Gets the entity with tone mark for the given plain entity and tone.

source code

str

_placeNucleusToneMark(self, nucleus, tone)
Places a tone mark on the given syllable nucleus according to the rules of the Pinyin standard.

source code

tuple

splitEntityTone(self, entity)
Splits the entity into an entity without tone mark and the entity's tone index.

source code

set of str

getPlainReadingEntities(self)
Gets the list of plain entities supported by this reading.

source code

list of str

getReadingEntities(self)
Gets a set of all entities supported by the reading.

source code

tuple of str

getOnsetRhyme(self, plainSyllable)
Splits the given plain syllable into onset (initial) and rhyme (final).

source code

Inherited from TonalRomanisationOperator: isPlainReadingEntity, isReadingEntity

Inherited from RomanisationOperator: decompose, getDecompositionTree, getDecompositions, segment

Inherited from RomanisationOperator (private): _hasMergeableSyllables, _hasSyllableSubstring, _recursiveSegmentation

Inherited from ReadingOperator: getOption

Inherited from object: __delattr__, __getattribute__, __hash__, __new__, __reduce__, __reduce_ex__, __repr__, __setattr__, __str__

Class Methods

[hide private]

dict

getDefaultOptions(cls)
Returns the reading operator's default options.

source code

dict

guessReadingDialect(cls, string, includeToneless=False)
Takes a string written in Pinyin and guesses the reading dialect.

source code

Static Methods

[hide private]

list of str

_getDiacriticVowels()
Gets a list of Pinyin vowels with diacritical marks for tones.

source code

Inherited from RomanisationOperator (private): _crossProduct, _treeToList

Class Variables

[hide private]

READING_NAME = 'Pinyin'
Unique name of reading

TONEMARK_VOWELS = [u'a', u'e', u'i', u'o', u'u', u'ü', u'n', u...
List of characters of the nucleus possibly carrying the tone mark.

TONEMARK_MAP = {u'̀': 4, u'́': 2, u'̄': 1, u'̌': 3}
Mapping of Combining Diacritical Marks to their Pinyin tone index.

PINYIN_SOUND_REGEX = re.compile(r'(?i)^([^aeiuo\xfc]*)([aeiuo\...
Regular Expression matching onset, nucleus and coda.

toneMarkRegex = re.compile(r'[\u0301\u0300\u0304\u030c]')
Regular Expression matching the Pinyin tone marks.

tonemarkMapReverse = {1: u'̄', 2: u'́', 3: u'̌', 4: u'̀'}

Inherited from RomanisationOperator: readingEntityRegex

Properties

[hide private]

Inherited from object: __class__

Method Details

Class PinyinOperator

Apostrophes

Example

R-colouring

Source

__init__(self, **options) (Constructor)

getDefaultOptions(cls) Class Method

_getDiacriticVowels() Static Method

guessReadingDialect(cls, string, includeToneless=False) Class Method

getTones(self)

compose(self, readingEntities)

removeApostrophes(self, readingEntities)

aeoApostropheRule(self, precedingEntity, followingEntity)

isStrictDecomposition(self, readingEntities)

getTonalEntity(self, plainEntity, tone)

_placeNucleusToneMark(self, nucleus, tone)

splitEntityTone(self, entity)

getPlainReadingEntities(self)

getReadingEntities(self)

getOnsetRhyme(self, plainSyllable)

TONEMARK_VOWELS

TONEMARK_MAP

PINYIN_SOUND_REGEX

init(self, **options)
(Constructor)

getDefaultOptions(cls)
Class Method

_getDiacriticVowels()
Static Method

guessReadingDialect(cls, string, includeToneless=False)
Class Method