Package pyxb :: Package utils :: Module xmlre
[hide private]
[frames] | no frames]

Module xmlre

source code

Support for regular expressions conformant to the XML Schema specification.

For the most part, XML regular expressions are similar to the POSIX ones, and can be handled by the Python re module. The exceptions are for multi-character (\w) and category escapes (e.g., \p{N} or \p{IPAExtensions}) and the character set subtraction capability. This module supports those by scanning the regular expression, replacing the category escapes with equivalent charset expressions. It further detects the subtraction syntax and modifies the charset expression to remove the unwanted code points.

The basic technique is to step through the characters of the regular expression, entering a recursive-descent parser when one of the translated constructs is encountered.

There is a nice set of XML regular expressions at http://www.xmlschemareference.com/examples/Ch14/regexpDemo.xsd, with a sample document at http://www.xmlschemareference.com/examples/Ch14/regexpDemo.xml

Classes [hide private]
  RegularExpressionError
Raised when a regular expression cannot be processed..
Functions [hide private]
 
_InitializeAllEsc()
Set the values in _AllEsc without introducing k and v into the module.
source code
 
_MatchCharClassEsc(text, position)
Parse a charClassEsc term.
source code
 
_MatchPosCharGroup(text, position)
Parse a posCharGroup term.
source code
 
_MatchCharClassExpr(text, position)
Parse a charClassExpr.
source code
 
MaybeMatchCharacterClass(text, position)
Attempt to match a character class expression.
source code
 
XMLToPython(pattern)
Convert the given pattern to the format required for Python regular expressions.
source code
Variables [hide private]
  _log = logging.getLogger(__name__)
  _AllEsc = {}
  _CharClassEsc_re = re.compile(r'\\(?:(?P<cgProp>[pP]\{(?P<char...
  __package__ = 'pyxb.utils'
Function Details [hide private]

_MatchCharClassEsc(text, position)

source code 

Parse a charClassEsc term.

This is one of:

  • SingleCharEsc, an escaped single character such as \n
  • MultiCharEsc, an escape code that can match a range of characters, e.g. \s to match certain whitespace characters
  • catEsc, the \p{...} Unicode property escapes including categories and blocks
  • complEsc, the \P{...} inverted Unicode property escapes

If the parsing fails, throws a RegularExpressionError.

Returns:
A pair (cps, p) where cps is a pyxb.utils.unicode.CodePointSet containing the code points associated with the character class, and p is the text offset immediately following the escape sequence.
Raises:

_MatchPosCharGroup(text, position)

source code 

Parse a posCharGroup term.

Returns:
A tuple (cps, fs, p) where:
  • cps is a pyxb.utils.unicode.CodePointSet containing the code points associated with the group;
  • fs is a bool that is True if the next character is the - in a charClassSub and False if the group is not part of a charClassSub;
  • p is the text offset immediately following the closing brace.
Raises:

_MatchCharClassExpr(text, position)

source code 

Parse a charClassExpr.

These are XML regular expression classes such as [abc], [a-c], [^abc], or [a-z-[q]].

Parameters:
  • text - The complete text of the regular expression being translated. The first character must be the [ starting a character class.
  • position - The offset of the start of the character group.
Returns:
A pair (cps, p) where cps is a pyxb.utils.unicode.CodePointSet containing the code points associated with the property, and p is the text offset immediately following the closing brace.
Raises:

MaybeMatchCharacterClass(text, position)

source code 

Attempt to match a character class expression.

Parameters:
  • text - The complete text of the regular expression being translated
  • position - The offset of the start of the potential expression.
Returns:
None if position does not begin a character class expression; otherwise a pair (cps, p) where cps is a pyxb.utils.unicode.CodePointSet containing the code points associated with the property, and p is the text offset immediately following the closing brace.

XMLToPython(pattern)

source code 

Convert the given pattern to the format required for Python regular expressions.

Parameters:
Returns:
A Unicode string specifying a Python regular expression that matches the same language as pattern.

Variables Details [hide private]

_CharClassEsc_re

Value:
re.compile(r'\\(?:(?P<cgProp>[pP]\{(?P<charProp>[-A-Za-z0-9]+)\})|(?P<\
cgClass>[^pP]))')