9. Unicode ComplianceΒΆ

jsre provides level 1 support for Unicode compliant with Unicode Technical Standard #18, UNICODE REGULAR EXPRESSIONS, version 1.7.

The module supports all:

  • Binary Properties (e.g. \p{Alphabetic}).
  • General Category Properties. (e.g. \pP).
  • Scripts and Script Extensions.
  • Line_Break properties (e.g. \p{line_break=hyphen}).
  • Numeric_Type properties (e.g. \p{numeric_type=decimal).

Property specification within the regular expression pattern is flexible; case does not matter, ‘-‘ and ‘_’ are interchangable, and general categories and scripts may be referenced by property name. (e.g. \p{greek} as well as \p{script=greek})

Some special properties are supported. Appendix C of UTS #18 recommends a set of properties for use in regular expressions, which provide extensions and combinations of standard character classes. These are:

lower, upper, punct, digit, xdigit, alnum, space, blank, cntrl, graph, print, word

word is defined as in UTS #18 and includes digits, \w uses the same definition. The zero width tests \b \B also use this definition to determine word boundaries. (Note that the more extensive algorithm given for word breaks in Unicode Standard Annex #29 is not used.)

The \X test for Extended Grapheme Cluster boundaries implements the extended version of the specification given in Unicode Standard Annex #29.

Some additional properties defined in UTS #18 1.2.1 and 1.6 are also supported:

any, assigned, ascii

Note that any is every code point, unlike ‘.’ which omits newline characters unless the DOTALL flag is set.

The property:

newline

is provided to specify the set of new line characters in UTS #18 1.6, ie the familiar \u000A, \u000B etc as well as the Unicode characters such as \u2028.

The actual support for Unicode properties depends on the encoding used for searching; for non-Unicode encodings (e.g. CP1250) properties are interpreted as the set of code points that can be represented under that encoding.