Package cjklib :: Package reading :: Module operator :: Class RomanisationOperator

Class RomanisationOperator

Defines an abstract ReadingOperator on text written in a romanisation, i.e. text written in the Latin alphabet or written in the Cyrillic alphabet.

Additional to decompose() provided by the class ReadingOperator this class offers a method getDecompositions() that returns several possible decompositions in an ambiguous case.

This class itself can't be used directly, it has to be subclassed and extended.

Decomposition

Transcriptions into the Latin alphabet generate the problem that syllable boundaries or boundaries of entities belonging to single Chinese characters aren't clear anymore once entities are grouped together.

Therefore it is important to have methods at hand to separate this strings and to split them into single entities. This though cannot always be done in a clear and unambiguous way as several different decompositions might be possible thus leading to the general case of ambiguous decompositions.

Many romanisations do provide a way to tackle this problem. Pinyin for example requires the use of an apostrophe (') when the reverse process of splitting the string into syllables gets ambiguous. The Wade-Giles romanisation in its strict implementation asks for a hyphen used between all syllables. The LSHK's Jyutping when written with tone marks will always be clearly decomposable.

The method isStrictDecomposition() can be implemented to check if one possible decomposition is the strict decomposition offered by the romanisation's protocol. This method should guarantee that under all circumstances only one decomposed version will be regarded as strict.

If no strict version is yielded and different decompositions exist an unambiguous decomposition can not be made. These decompositions can be accessed through method getDecompositions(), even in a cases where a strict decomposition exists.

To Do (Impl): Optimise decompose() as to incorporate segment() and prune the tree while it is created. Does this though yield significant improvement? Would at least be O(n).

Instance Methods

[hide private]

__init__(self, **options)
Creates an instance of the RomanisationOperator.

source code

list of str

decompose(self, string)
Decomposes the given string into basic entities on a one-to-one mapping level to Chinese characters.

source code

list

getDecompositionTree(self, string)
Decomposes the given string into basic entities that can be mapped to one Chinese character each for all possible decompositions and returns the possible decompositions as a lattice.

source code

list of list of str

getDecompositions(self, string)
Decomposes the given string into basic entities that can be mapped to one Chinese character each for all possible decompositions.

source code

list of list of str

segment(self, string)
Takes a string written in the romanisation and returns the possible segmentations as a list of syllables.

source code

list of tuple

_recursiveSegmentation(self, string)
Takes a string written in the romanisation and returns the possible segmentations as a tree of syllables.

source code

bool

_hasMergeableSyllables(self, decomposition)
Checks if the given decomposition has two or more following syllables which together make up a new syllable.

source code

bool

isStrictDecomposition(self, decomposition)
Checks if the given decomposition follows the romanisation format strictly to allow unambiguous decomposition.

source code

bool

_hasSyllableSubstring(self, string)
Checks if the given string is a syllable supported by this romanisation or a substring of one.

source code

bool

isReadingEntity(self, entity)
Returns true if the given entity is recognised by the romanisation operator, i.e.

source code

set of str

getReadingEntities(self)
Gets a set of all entities supported by the reading.

source code

Inherited from ReadingOperator: compose, getOption

Inherited from object: __delattr__, __getattribute__, __hash__, __new__, __reduce__, __reduce_ex__, __repr__, __setattr__, __str__

Class Methods

[hide private]

dict

getDefaultOptions(cls)
Returns the reading operator's default options.

source code

Static Methods

[hide private]

list of list

_crossProduct(singleLists)
Calculates the cross product (aka Cartesian product) of sets given as lists.

source code

list of list

_treeToList(tupleTree)
Converts a tree to a list containing all full paths from root to leaf node.

source code

Class Variables

[hide private]

readingEntityRegex = re.compile(r'([A-Za-z]+)')
Regular Expression for finding romanisation entities in input.

Inherited from ReadingOperator: READING_NAME

Properties

[hide private]

Inherited from object: __class__

Method Details

[hide private]

init(self, **options)
(Constructor)

source code

Creates an instance of the RomanisationOperator.

Parameters:

options - extra options
dbConnectInst - instance of a DatabaseConnector, if none is given, default settings will be assumed.
strictSegmentation - if True segmentation (using segment()) and thus decomposition (using decompose()) will raise an exception if an alphabetic string is parsed which can not be segmented into single reading entities. If False the aforesaid string will be returned unsegmented.
case - if set to 'lower'/'upper', only lower/upper case will be supported, respectively, if set to 'both' both upper and lower case will be supported.

Overrides: object.__init__

getDefaultOptions(cls)
Class Method

source code

Returns the reading operator's default options.

The default implementation returns an empty dictionary. The keyword 'dbConnectInst' is not regarded a configuration option of the operator and is thus not included in the dict returned.

Returns: dict: the reading operator's default options.
Overrides: ReadingOperator.getDefaultOptions: (inherited documentation)

decompose(self, string)

source code

Decomposes the given string into basic entities on a one-to-one mapping level to Chinese characters. Decomposing can be ambiguous and there are two assumptions made to solve this problem: If two subsequent entities together make up a longer valid entity, then the decomposition with the shorter entities can be disregarded. Furthermore it is assumed that the reading provides rules to mark entity borders and that these rules can be checked, so that the decomposition that abides by this rules will be prefered. This check is done by calling isStrictDecomposition().

The given input string can contain other characters not supported by the reading, e.g. punctuation marks. The returned list then contains a mix of basic reading entities and other characters e.g. spaces and punctuation marks.

Parameters:

string (str) - reading string

Returns: list of str

a list of basic entities of the input string

Raises:

AmbiguousDecompositonError - if decomposition is ambiguous.
DecompositionError - if the given string has a wrong format.

Overrides: ReadingOperator.decompose

getDecompositionTree(self, string)

source code

Decomposes the given string into basic entities that can be mapped to one Chinese character each for all possible decompositions and returns the possible decompositions as a lattice.

Parameters:

string (str) - reading string

Returns: list

a list of all possible decompositions consisting of basic entities as a lattice construct.

Raises:

DecompositionError - if the given string has a wrong format.

getDecompositions(self, string)

source code

Decomposes the given string into basic entities that can be mapped to one Chinese character each for all possible decompositions. This method is a more general version of decompose().

The returned list construction consists of two entity types: entities of the romanisation and other strings.

Parameters:

string (str) - reading string

Returns: list of list of str

a list of all possible decompositions consisting of basic entities.

Raises:

DecompositionError - if the given string has a wrong format.

segment(self, string)

source code

Takes a string written in the romanisation and returns the possible segmentations as a list of syllables.

In contrast to decompose() this method merely segments continuous entities of the romanisation. Characters not part of the romanisation will not be dealt with, this is the task of the more general decompose method.

Parameters:

string (str) - reading string

Returns: list of list of str

a list of possible segmentations (several if ambiguous) into single syllables

Raises:

DecompositionError - if the given string has an invalid format.

_recursiveSegmentation(self, string)

source code

Takes a string written in the romanisation and returns the possible segmentations as a tree of syllables.

The tree is represented by tuples (syllable, subtree).

Parameters:

string (str) - reading string

Returns: list of tuple

a tree of possible segmentations (if ambiguous) into single syllables

_hasMergeableSyllables(self, decomposition)

source code

Checks if the given decomposition has two or more following syllables which together make up a new syllable.

Segmentation can give several results with some possible syllables being even further subdivided (e.g. tian to ti'an in Pinyin). These segmentations are only secondary and the segmentation with the longer syllables will be the one to take.

Parameters:

decomposition (list of str) - decomposed reading string

Returns: bool

True if following syllables make up a syllable

isStrictDecomposition(self, decomposition)

source code

Checks if the given decomposition follows the romanisation format strictly to allow unambiguous decomposition.

The romanisation should offer a way/protocol to make an unambiguous decomposition into it's basic syllables possible as to make the process of appending syllables to a string reversible. The testing on compliance with this protocol has to be implemented here. Thus this method can only return true for one and only one possible decomposition for all strings.

Parameters:

decomposition (list of str) - decomposed reading string

Returns: bool

False, as this methods needs to be implemented by the sub class

_hasSyllableSubstring(self, string)

source code

Checks if the given string is a syllable supported by this romanisation or a substring of one.

Parameters:

string (str) - romanisation syllable or substring

Returns: bool

true if this string is a substring of a syllable, false otherwise

isReadingEntity(self, entity)

source code

Returns true if the given entity is recognised by the romanisation operator, i.e. it is a valid entity of the reading returned by the segmentation method.

Reading entities will be handled as being case insensitive.

Parameters:

entity (str) - entity to check

Returns: bool

True if string is an entity of the reading, False otherwise.

Overrides: ReadingOperator.isReadingEntity

getReadingEntities(self)

source code

Gets a set of all entities supported by the reading.

The list is used in the segmentation process to find entity boundaries. The default implementation will raise a NotImplementedError.

Returns: set of str: set of supported syllables

_crossProduct(singleLists)
Static Method

source code

Calculates the cross product (aka Cartesian product) of sets given as lists.

Example:

>>> RomanisationOperator._crossProduct([['A', 'B'], [1, 2, 3]])
[['A', 1], ['A', 2], ['A', 3], ['B', 1], ['B', 2], ['B', 3]]

Parameters:

singleLists (list of list) - a list of list entries containing various elements

Returns: list of list

the cross product of the given sets

_treeToList(tupleTree)
Static Method

source code

Converts a tree to a list containing all full paths from root to leaf node.

The tree is given by tuples (leaf node element, subtree).

Example:

>>> RomanisationOperator._treeToList(
...     ('A', [('B', None), ('C', [('D', None), ('E', None)])]))
[['A', 'B'], ['A', 'C', 'D'], ['A', 'C', 'E']]

Parameters:

tupleTree (tuple) - a tree realised through a tuple of a node and a subtree

Returns: list of list

a list of all paths contained by the given tree

Class RomanisationOperator

Decomposition

__init__(self, **options) (Constructor)

getDefaultOptions(cls) Class Method

decompose(self, string)

getDecompositionTree(self, string)

getDecompositions(self, string)

segment(self, string)

_recursiveSegmentation(self, string)

_hasMergeableSyllables(self, decomposition)

isStrictDecomposition(self, decomposition)

_hasSyllableSubstring(self, string)

isReadingEntity(self, entity)

getReadingEntities(self)

_crossProduct(singleLists) Static Method

_treeToList(tupleTree) Static Method

init(self, **options)
(Constructor)

getDefaultOptions(cls)
Class Method

_crossProduct(singleLists)
Static Method

_treeToList(tupleTree)
Static Method