Class PinyinOperator
source code
Provides an operator for the Mandarin romanisation Hanyu Pinyin. It
can be configured to cope with different representations
("dialects") of Pinyin. For conversion between different
representations the PinyinDialectConverter can be used.
Features:
-
tones marked by either diacritics or numbers,
-
alternative representation of ü-character,
-
correct placement of apostrophes,
-
guessing of input form (reading dialect),
-
support for Erhua and
-
splitting of syllables into onset and rhyme.
Apostrophes
Pinyin syllables need to be separated by an apostrophe in case
their decomposition will get ambiguous. A famous example might be the
city Xi'an, which if written xian would be read as one
syllable, meaning e.g. 'fresh'. Another example would be
Chang'an which could be read chan'gan if no delimiter is
used in at least one of both cases.
Different rules exist where to place apostrophes. A simple yet
sufficient rule is implemented in aeoApostropheRule() which is used as default in this
class. Syllables starting with one of the three vowels a,
e, o will be separated. Remember that vowels [i], [u],
[y] are represented as yi, wu, yu respectively,
thus making syllable boundaries clear. compose() will place apostrophes where required when
composing the reading string.
An alternative rule can be specified to the constructor passing a
function as an option PinyinApostropheFunction
. A possible
function could be a rule separating all syllables by an apostrophe thus
simplifying the reading process for beginners.
On decomposition of strings it is important to check which of the
possibly several choices will be the one actually meant. E.g. syllable
xian given above should always be segmented into one syllable,
solution xi'an is not an option in this case. Therefore an
alternative to aeoApostropheRule() should make sure it guarantees
proper decomposition, which is tested through isStrictDecomposition().
Last but not least compose(decompose(string))
will only
be the identity if apostrophes are applied properly according to the
rule as wrongly placed apostrophes will be kept when composing. Use removeApostrophes() to remove separating
apostrophes.
Example
>>> def noToneApostropheRule(precedingEntity, followingEntity):
... return precedingEntity and precedingEntity[0].isalpha() \
... and not precedingEntity[-1].isdigit() \
... and followingEntity[0].isalpha()
...
>>> from cjklib.reading import ReadingFactory
>>> f = ReadingFactory()
>>> f.convert('an3ma5mi5ba5ni2mou1', 'Pinyin', 'Pinyin',
... sourceOptions={'toneMarkType': 'Numbers'},
... targetOptions={'toneMarkType': 'Numbers',
... 'missingToneMark': 'fifth',
... 'PinyinApostropheFunction': noToneApostropheRule})
u"an3ma'mi'ba'ni2mou1"
R-colouring
The phenomenon Erhua (兒化音/儿化音, Erhua yin), i.e. the r-colouring of
syllables, is found in the northern Chinese dialects and results from
merging the formerly independent sound er with the preceding
syllable. In written form a word is followed by the character 兒/儿, e.g.
頭兒/头儿.
In Pinyin the Erhua sound is quite often expressed by appending a
single r to the syllable of the character preceding 兒/儿, e.g.
tóur for 頭兒/头儿, to stress the monosyllabic nature and in
contrast to words like 兒子/儿子 ér'zi where 兒/儿 ér
constitutes a single syllable.
For decomposing syllables in Pinyin it is thus important to decide
if the r marking r-colouring should be an entity on its own
account stressing the representation in the character string with an
own character or rather stressing the monosyllabic nature and being
part of a syllable of the foregoing character. This can be configured
once instantiation.
Source
-
Yǐn Bīnyōng (尹斌庸), Mary Felley (傅曼丽): Chinese romanization:
Pronunciation and Orthography (汉语拼音和正词法). Sinolingua, Beijing,
1990, ISBN 7-80052-148-6, ISBN 0-8351-1930-0.
See Also:
To Do (Impl):
-
ISO 7098 asks for conversion of
。、·「」
to
.,-«»
. What about ,?《》:-
? Implement a method
for conversion to be optionally used.
-
Strict testing of tone mark placement. Currently it doesn't matter
where tones are placed. All combinations are recognised.
-
Special marker for neutral tone: 'mȧ' (u'mȧ', reported by Ching-song
Gene Hsiao: A Manual of Transcription Systems For Chinese, 中文拼音手册. Far
Eastern Publications, Yale University, New Haven, Connecticut, 1985,
ISBN 0-88710-141-0.), and '·ma' (u'\xb7ma', check!: 现代汉语词典(第5版)[Xiàndài
Hànyǔ Cídiǎn 5. Edition]. 商务印书馆 [Shāngwù Yìnshūguǎn], Beijing, 2005,
ISBN 7-100-04385-9.)
|
|
list
|
|
str
|
compose(self,
readingEntities)
Composes the given list of basic entities to a string. |
source code
|
|
list of str
|
|
bool
|
|
bool
|
isStrictDecomposition(self,
readingEntities)
Checks if the given decomposition follows the Pinyin format strictly
for unambiguous decomposition: syllables have to be preceded by an
apostrophe if the decomposition would be ambiguous otherwise. |
source code
|
|
str
|
|
str
|
|
tuple
|
|
set of str
|
|
list of str
|
|
tuple of str
|
|
Inherited from TonalRomanisationOperator :
isPlainReadingEntity ,
isReadingEntity
Inherited from RomanisationOperator :
decompose ,
getDecompositionTree ,
getDecompositions ,
segment
Inherited from ReadingOperator :
getOption
Inherited from object :
__delattr__ ,
__getattribute__ ,
__hash__ ,
__new__ ,
__reduce__ ,
__reduce_ex__ ,
__repr__ ,
__setattr__ ,
__str__
|
|
READING_NAME = ' Pinyin '
Unique name of reading
|
|
TONEMARK_VOWELS = [ u' a ' , u' e ' , u' i ' , u' o ' , u' u ' , u' ü ' , u' n ' , u ...
List of characters of the nucleus possibly carrying the tone mark.
|
|
TONEMARK_MAP = { u' ̀ ' : 4, u' ́ ' : 2, u' ̄ ' : 1, u' ̌ ' : 3}
Mapping of Combining Diacritical Marks to their Pinyin tone
index.
|
|
PINYIN_SOUND_REGEX = re.compile(r'(?i) ^( [ ^ aeiuo\xfc] * ) ( [ aeiuo\...
Regular Expression matching onset, nucleus and coda.
|
|
toneMarkRegex = re.compile(r'[ \u0301\u0300\u0304\u030c] ')
Regular Expression matching the Pinyin tone marks.
|
|
tonemarkMapReverse = { 1: u' ̄ ' , 2: u' ́ ' , 3: u' ̌ ' , 4: u' ̀ ' }
|
Inherited from RomanisationOperator :
readingEntityRegex
|
Inherited from object :
__class__
|
Creates an instance of the PinyinOperator.
The class instance can be configured by different optional options
given as keywords.
- Parameters:
options - extra options
dbConnectInst - instance of a DatabaseConnector, if none is given, default
settings will be assumed.
strictSegmentation - if True segmentation (using segment()) and thus decomposition (using decompose()) will raise an exception if an
alphabetic string is parsed which can not be segmented into
single reading entities. If False the aforesaid
string will be returned unsegmented.
toneMarkType - if set to 'Diacritics' tones will be marked using
diacritic marks, if set to 'Numbers' appended
numbers from 1 to 5 will be used to mark tones, if set to
'None' no tone marks will be used and no tonal
information will be supplied at all.
missingToneMark - if set to 'fifth' no tone mark is set to indicate
the fifth tone (qingsheng, e.g. 'wo3men'
stands for 'wo3men5' ), if set to
'noinfo' , no tone information will be deduced when
no tone mark is found (takes on value None ), if set
to 'ignore' this entity will not be valid and for
segmentation the behaviour defined by
'strictSegmentation' will take affect. This option
is only valid for the tone mark type 'Numbers' .
yVowel - a character (or string) that is taken as alternative for ü
which depicts (among others) the close front rounded vowel [y]
(IPA) in Pinyin and includes an umlaut. Changes forms of
syllables nü, nüe, lü, lüe. This option is not valid for
the tone mark type 'Diacritics' .
PinyinApostrophe - an alternate apostrophe that is taken instead of the default one.
PinyinApostropheFunction - a function that indicates when a syllable combination needs to be
split by an apostrophe, see aeoApostropheRule() for the default
implementation.
Erhua - if set to 'ignore' no special support will be
provided for retroflex -r at syllable end (Erhua), i.e.
zher will raise an exception. If set to
'twoSyllables' syllables with an append r are
given/will be segmented into two syllables, the -r suffix making
up one syllable itself as 'r' . If set to
'oneSyllable' syllables with an appended r are
given/will be segmented into one syllable only.
- Overrides:
object.__init__
|
Returns the reading operator's default options.
The default implementation returns an empty dictionary. The keyword
'dbConnectInst' is not regarded a configuration option of the operator
and is thus not included in the dict returned.
- Returns: dict
- the reading operator's default options.
- Overrides:
ReadingOperator.getDefaultOptions
- (inherited documentation)
|
Gets a list of Pinyin vowels with diacritical marks for tones.
The alternative for vowel ü does not need diacritical forms as the
standard form doesn't allow changing the vowel.
- Returns: list of str
- list of Pinyin vowels with diacritical marks
|
guessReadingDialect(cls,
string,
includeToneless=False)
Class Method
| source code
|
Takes a string written in Pinyin and guesses the reading dialect.
The basic options 'toneMarkType' , 'yVowel'
and 'Erhua' are guessed. Unless
'includeToneless' is set to True only the tone
mark types 'Diacritics' and 'Numbers' are
considered as the latter one can also represent the state of missing
tones. Strings tested for 'yVowel' are ü ,
v and u: . 'Erhua' is set to
'twoSyllables' by default and only tested when
'toneMarkType' is assumed to be set to
'Numbers' .
- Parameters:
string (str) - Pinyin string
- Returns: dict
- dictionary of basic keyword settings
|
Returns a set of tones supported by the reading. These tones don't
necessarily reflect the tones of the underlying language but may defer to
reflect notational or other features.
The default implementation will raise a NotImplementedError.
- Returns: list
- list of supported tone marks.
- Overrides:
TonalFixedEntityOperator.getTones
- (inherited documentation)
|
Composes the given list of basic entities to a string. Applies an
apostrophe between syllables if needed using default implementation aeoApostropheRule().
- Parameters:
readingEntities (list of str) - list of basic syllables or other content
- Returns: str
- composed entities
- Overrides:
ReadingOperator.compose
|
Removes apostrophes between two syllables for a given
decomposition.
- Parameters:
readingEntities (list of str) - list of basic syllables or other content
- Returns: list of str
- the given entity list without separating apostrophes
|
aeoApostropheRule(self,
precedingEntity,
followingEntity)
| source code
|
Checks if the given entities need to be separated by an
apostrophe.
Returns true for syllables starting with one of the three vowels
a, e, o having a preceding syllable. Additionally
forms n and ng are separated from preceding syllables.
Furthermore corner case e'r will handled to distinguish from
er.
This function serves as the default apostrophe rule.
- Parameters:
precedingEntity (str) - the preceding syllable or any other content
followingEntity (str) - the following syllable or any other content
- Returns: bool
- true if the syllables need to be separated, false otherwise
|
isStrictDecomposition(self,
readingEntities)
| source code
|
Checks if the given decomposition follows the Pinyin format strictly
for unambiguous decomposition: syllables have to be preceded by an
apostrophe if the decomposition would be ambiguous otherwise.
The function stored given as option
'PinyinApostropheFunction' is used to check if a apostrophe
should have been placed.
- Parameters:
readingEntities (list of str) - decomposed reading string
- Returns: bool
- true if decomposition is strict, false otherwise
- Overrides:
RomanisationOperator.isStrictDecomposition
|
Gets the entity with tone mark for the given plain entity and
tone.
The default implementation will raise a NotImplementedError.
- Parameters:
plainEntity - entity without tonal information
tone - tone
- Returns: str
- entity with appropriate tone
- Raises:
- Overrides:
TonalFixedEntityOperator.getTonalEntity
- (inherited documentation)
|
Places a tone mark on the given syllable nucleus according to the
rules of the Pinyin standard.
- Parameters:
nucleus (str) - syllable nucleus
tone (int) - tone index (starting with 1)
- Returns: str
- nucleus with appropriate tone
|
Splits the entity into an entity without tone mark and the entity's
tone index.
The plain entity returned will always be in Unicode's Normalization
Form C (NFC, see http://www.unicode.org/reports/tr15/).
- Parameters:
entity (str) - entity with tonal information
- Returns: tuple
- plain entity without tone mark and entity's tone index (starting
with 1)
- Raises:
- Overrides:
TonalFixedEntityOperator.splitEntityTone
|
Gets the list of plain entities supported by this reading. Different
to getReadingEntities() the entities will carry no tone
mark.
Depending on the type of Erhua support either additional syllables
with an ending -r are added, or a single r is included. The user
specified character for vowel ü will be used.
- Returns: set of str
- set of supported syllables
- Overrides:
TonalFixedEntityOperator.getPlainReadingEntities
|
Gets a set of all entities supported by the reading.
The list is used in the segmentation process to find entity
boundaries.
- Returns: list of str
- list of supported syllables
- Overrides:
TonalFixedEntityOperator.getReadingEntities
- (inherited documentation)
|
Splits the given plain syllable into onset (initial) and rhyme
(final).
Pinyin can't be separated into onset and rhyme clearly within its own
system. There are syllables with same finals written differently (e.g.
wei and dui both ending in a final that can be described by
uei) and reduction of vowels (same example: dui which is
pronounced with vowels uei). This method will use three forms not
found as substrings in Pinyin (uei, {uen} and iou) and
substitutes (pseudo) initials w and y with its vowel
equivalents.
Furthermore final i will be distinguished in three forms given
by the following three examples: yi, zhi and zi to
express phonological difference.
- Parameters:
plainSyllable (str) - syllable without tone marks
- Returns: tuple of str
- tuple of entity onset and rhyme
- Raises:
|
TONEMARK_VOWELS
List of characters of the nucleus possibly carrying the tone mark.
n is included in standalone syllables n and ng.
r is used for supporting Erhua in a two syllable form.
- Value:
[ u' a ' , u' e ' , u' i ' , u' o ' , u' u ' , u' ü ' , u' n ' , u' m ' , u' r ' , u' ê ' ]
|
|
TONEMARK_MAP
Mapping of Combining Diacritical Marks to their Pinyin tone
index.
See Also:
-
The Unicode Consortium: The Unicode Standard, Version 5.0.0,
Chapter 7, European Alphabetic Scripts, 7.9 Combining Marks,
defined by: The Unicode Standard, Version 5.0 (Boston, MA,
Addison-Wesley, 2007. ISBN 0-321-48091-0), http://www.unicode.org/versions/Unicode5.0.0/
-
Unicode: Combining Diacritical Marks, Range:
0300-036F: http://www.unicode.org/charts/PDF/U0300.pdf
-
Unicode: FAQ - Characters and Combining Marks: http://unicode.org/faq/char_combmark.html
- Value:
{ u' ̀ ' : 4, u' ́ ' : 2, u' ̄ ' : 1, u' ̌ ' : 3}
|
|
PINYIN_SOUND_REGEX
Regular Expression matching onset, nucleus and coda. Syllables 'n',
'ng', 'r' (for Erhua) and 'ê' have to be handled separately.
- Value:
re.compile(r'(?i) ^( [ ^ aeiuo\xfc] * ) ( [ aeiuo\xfc] * ) ( [ ^ aeiuo\xfc] * ) $')
|
|