modgrammar – The Modular Grammar Engine

This module provides a full-featured pure-python framework for building tokenizing LR language parsers and interpreters for context-free grammars. (The modgrammar parsing engine is implemented as a recursive-descent parser with backtracking, using an object-oriented grammar model.)

The modgrammar parser is designed such that language grammars can be defined in python modules using standard python syntax. To create a new grammar, simply create a new class definition (or multiple class definitions) derived from the Grammar base class, and set its grammar attribute to a list of sub-grammars to match. (Such definitions can be combined together into full grammar trees.) Several basic pre-defined grammar constructs are also available in this module which larger grammars can be built up from.

Once a grammar is defined, the parser() method can be called on the toplevel grammar class to obtain a GrammarParser object, which can be used to parse text against the defined grammar.

Grammar Classes

class modgrammar.Grammar

This class is not intended to be instantiated directly. Instead, it is a base class to be used for defining your own custom grammars.

Subclasses of Grammar serve two purposes. The first is to actually define the grammar used for parsing. The second is to serve as a base type for result objects created as the result of parsing.

To define a new grammar, you should create a new class definition, descended from Grammar. In this class definition, you can override several class attributes and class methods to customize the behavior of the grammar.

Class Attributes

Grammar.grammar

A list of sub-grammars used to make up this grammar. This attribute almost always needs to be specified when defining a new grammar class, and is the way in which full grammars can be built up from smaller grammar elements. In order for this grammar to match an input text, it must completely match each of the sub-grammars listed in its grammar attribute, in order.

Grammar.grammar_tags

A list of “tags” to be associated with match result objects produced by this grammar. These tags can be used with the find() and find_all() methods to extract specific elements from a parse tree after a successful match.

Grammar.grammar_collapse

If set to True, indicates that this grammar element should be “collapsed” when constructing the final parse tree. Any place in the parse tree where an instance of this grammar would normally occur will instead be replaced by the list of its sub-elements. This can be used to make result parse trees simpler in cases where a grammar element has been defined for convenience of the grammar definition, but does not provide a lot of useful information in the resulting parse tree.

Grammar.grammar_name

A string used to identify this grammar. This is used in a variety of places, including repr(), str(), generate_ebnf(), etc. (Defaults to the name of the grammar class.)

Grammar.grammar_desc

A descriptive string for the grammar to be used in ParseError error messages. (Defaults to the same value as grammar_name.)

Grammar.grammar_noteworthy

If False, this grammar will not be mentioned as one of the possible matches when constructing ParseError error messages (this can be used for grammars which are so ubiquitous or implied that it’s not generally necessary to mention them to the user).

Grammar.grammar_error_override

If set, this grammar will report all match failures by its subgrammars as if it had failed itself. This effectively “hides” the subgrammars in any ParseError (which will use this grammar’s location and grammar_desc instead when constructing error messages, etc).

Grammar.grammar_whitespace_mode

Determines how whitespace is treated between sub-elements of this grammar. It can be set to one of the following values:

‘explicit’ (default)
Whitespace between sub-grammars will not be treated differently than any other syntactic element. Any whitespace which is allowed must be specified explicitly as part of the grammar definition (for example, using WHITESPACE), and any extra whitespace will be considered an error.
‘optional’
The grammar will automatically skip over any whitespace found between its sub-grammars (it will be “whitespace consuming”).
‘required’
Like ‘optional’, the grammar will skip over whitespace, but it will also require that there must be some amount of whitespace between each of its sub-grammars (if two sub-grammars occur in the input without any whitespace between them, that will be considered an error).

Note

In general, you will want to set this universally for your whole grammar. The best way to do this is to define a grammar_whitespace_mode module-level variable in the same module as your grammar classes are defined. If this is present, it will be used as the default for all grammar classes in that module.

Grammar.grammar_whitespace

In the case where you want a grammar to be “whitespace consuming” but want something other than the normal definition of “whitespace”, you can also set grammar_whitespace to a custom regular expression object to be used instead. This regular expression should attempt to match as much whitespace as possible, starting at the specified position in the string (the actual match result is not used, except that its length is used to determine how far to skip ahead in the string).

Modgrammar comes with two standard whitespace regexps which can be used “out of the box” for grammar_whitespace:

Name Meaning
WS_DEFAULT Any whitespace characters (default)
WS_NOEOL Any whitespace characters except EOL characters

(For more information on what constitutes whitespace and EOL characters, see Whitespace and Newline Handling)

Similar to grammar_whitespace_mode, you can also set grammar_whitespace as a module-level variable, in which case it will be used as the default for all grammar classes in that module.

There are also a few less-commonly-used class attributes which may be useful when inspecting grammars, or may be overridden in special cases:

Grammar.grammar_terminal

Determines whether this grammar is considered to be a “terminal” for the purposes of terminals(), tokens(), and generate_ebnf(). By default, any grammar which contains no sub-grammars (an empty grammar attribute) is considered to be a terminal, and any grammar which has sub-grammars is not.

Grammar.grammar_greedy

If set to True (default), indicates that in cases where this grammar could match multiple instances of a sub-text (i.e. for grammars that match repetitions), it should attempt to match the longest possible string first. By contrast, if set to False, the grammar will attempt to match the shortest repetition first.

Note

This attribute does not have any affect on most custom grammars (because most custom grammars are not themselves repetition grammars (instances of Repetition)). If you are looking to change this behavior in your own grammar definitions, you likely want to use the greedy parameter of REPEAT() (and related functions) instead. Changing this attribute is mainly useful if for some reason you want to make a custom subclass of Repetition, or if you are making a custom grammar element (with a custom grammar_parse() definition) for which this setting might be significant.

Grammar.grammar_collapse_skip

Specifies that, if an enclosing grammar is set to collapse, and this grammar is in its sub-grammar list, instances of this sub-grammar should also be left out of the resulting parse tree.

Note

There is usually no reason to set this attribute. (It is enabled by default for LITERAL() grammars, as it is often desirable to leave literal matches out when collapsing grammars since they usually provide no information which isn’t already known to the grammar designer.)

Overridable Class Methods

The following methods may be overridden in subclasses to change the default behavior of a grammar class:

classmethod Grammar.grammar_details(depth=-1, visited=None)

Returns a string containing a description of the contents of this grammar definition (such as used by repr()).

depth specifies a recursion depth to use when constructing the string description. If depth is nonzero, this method will recursively call grammar_details() for each of its sub-grammars in turn to construct the final description. If depth is zero, this method will just return this grammar’s name (grammar_name). If depth is negative, recursion depth is not limited.

visited is used for detecting circular references during recursion. It can contain a tuple of grammars which have already been seen and should not be descended into again.

classmethod Grammar.grammar_ebnf_lhs(opts)

Determines the string to be used for this grammar when it occurs in the left-hand-side (LHS) of an EBNF definition. This can be overridden to customize how this grammar is represented by generate_ebnf().

Returns a tuple (string, grammars), where string is the EBNF LHS string to use, and grammars is a list of other grammars on which this one depends (i.e. grammars whose names are referenced in string).

classmethod Grammar.grammar_ebnf_rhs(opts)

Determines the string to be used to describe this grammar when it occurs in the right-hand-side (RHS) of an EBNF definition. This can be overridden to customize how this grammar is represented by generate_ebnf().

Returns a tuple (string, grammars), where string is the EBNF RHS string to use, and grammars is a list of other grammars on which this one depends (i.e. grammars whose names are referenced in string).

classmethod Grammar.grammar_parse(text, index, session)

This method is called by the modgrammar parser system to actually attempt to match this grammar against a piece of text. This method is not intended to be called directly by an application (use the parser() method to obtain a GrammarParser object and use that). In advanced cases, this method can be overridden to provide custom parsing behaviors for a particular grammar type.

Note

Overriding this method is very complicated and currently beyond the scope of this documentation. This is not recommended for anyone who does not understand the modgrammar parser design very well. (Someday, with luck, there will be some more documentation written on this advanced topic.)

Useful Class Methods

The following methods are intended to be called on grammar classes by the application:

classmethod Grammar.parser(sessiondata=None, tabs=1, debug=False, debug_flags=None)

Return a GrammarParser associated with this grammar.

If provided, sessiondata can contain data which should be provided to the grammar_elem_init() method of each result object created during parsing.

The tabs parameter indicates the width of “tab stops” in the input (i.e. how far a “tab” character will advance the column position when encountered). This is only used to correctly report column numbers in ParseErrors. If you don’t care about that, or your input does not contain tabs, you can ignore this parameter.

The debug and debug_flags options control whether and how debugging information will be output while using this parser. For more information on grammar debugging, see the modgrammar.debugging module documentation.

classmethod Grammar.grammar_resolve_refs(refmap={}, recurse=True, follow=False, missing_ok=False, skip=None)

Resolve any REF() declarations within the grammar and replace them with the actual sub-grammars they refer to. The following optional arguments can be provided:

refmap
If provided, contains a dictionary of reference-name to grammar mappings to use. If a reference’s name is found in this dictionary, the dictionary value will be used to replace it, instead of using the standard name-lookup procedure.
recurse
If set to True, will perform a recursive search into each of this grammar’s sub-grammars, calling grammar_resolve_refs() on each with the same parameters.
follow
If set to True (and recurse is also True), will also call grammar_resolve_refs() on the result of each REF() after it is resolved.
missing_ok
If True, it is not considered an error if a REF() construct cannot be resolved at this time (it will simply be left as a REF() in the resulting grammar). If False, then all references must be resolvable or an UnresolvedReference exception will be raised.
skip
An optional list of grammars which should not be searched for REF() constructs (useful in conjunction with recurse to exclude certain parts of the grammar).

Result Objects

As match result objects are actually instances of the grammar class which produced the match, it is also possible, when defining a new grammar class, to override or add new instance methods which will affect the behavior of any associated result objects. Result objects also posess a number of attributes and methods which can be useful when examining parse results.

Overridable Instance Methods

Grammar.grammar_elem_init(sessiondata)

This method is called on each result object after it is fully initialized, before the resulting parse tree is returned to the caller. It can be overridden to perform any custom initialization desired (the default implementation does nothing).

Note: If grammar_collapse is True, this method is not called, and grammar_collapsed_elems() is called instead.

Grammar.grammar_collapsed_elems(sessiondata)

Return the list of elements to be used in place of this one when collapsing (this is only used if grammar_collapse is True).

Useful Instance Attributes

Grammar.string

Contains the portion of the text string which this match corresponds to.

Grammar.elements

Contains result objects for each of the sub-grammars that make up this grammar match. There is typically one entry in elements for each entry in grammar (though there may not be a direct correspondence if things like grammar_collapse are used)

Useful Instance Methods

Grammar.get(*tt_path)

Return the first immediate sub-element of the given type/tag (or by descending through multiple types/tags, in the same way as get_all()).

This is equivalent to .get_all(*tt_path)[0] except that it is more efficient, and will return None if there are no such objects (instead of raising IndexError).

Grammar.get_all(*tt_path)

Return all immediate sub-elements of the given type, or posessing the given tag.

If more than one type/tag parameter is provided, it will treat the types as a “path” to traverse: for all sub-elements matching the first type/tag, retrieve all sub-elements of those matching the second type/tag, and so on, until it reaches the last type/tag in the list. (It will thus return all elements of the parse tree which are of the final type/tag, which can be reached by traversing the previous types/tags in order.)

Note that types and tags can be intermingled, as well.

Grammar.find(*tt_path)

Return the first element anywhere in the parse tree matching the given type/tag (or by descending through multiple types/tags, in the same way as find_all()).

This is equivalent to .find_all(*tt_path)[0] except that it is more efficient, and will return None if there are no such objects (instead of raising IndexError).

Grammar.find_all(*tt_path)

Return all elements anywhere in the parse tree matching the given type/tag.

Similar to get_all(), if more than one type/tag parameter is provided, it will treat them as a “path” to traverse in order. The difference from get_all() is that, for each step in the path, the elements found do not have to be direct sub-elements, but can be anywhere in the sub-tree.

Grammar.terminals()

Return an ordered list of all result objects in the parse tree which are terminals (that is, where grammar_terminal is True).

Grammar.tokens()

Return the parsed string, broken down into its smallest grammatical components. (Another way of looking at this is that it returns the string values of all of the terminals().)

Parser Objects

class modgrammar.GrammarParser

Parser objects are the way in which an application can actually make use of a grammar definition. They provide the core interface to take input texts and attempt to match them against an associated grammar definition.

GrammarParser objects are not generally instantiated directly. Instead, to obtain one, call the parser() method on the appropriate grammar class.

Parser objects have the following useful attributes:

char

The number of characters we’ve successfully parsed since the beginning of parsing (or the last reset()).

line

The number of lines we’ve successfully parsed since the beginning of parsing (or the last reset()). This is measured based on the number of line-end sequences we’ve seen thus far.

col

The position within the current line we’re at.

Useful Methods

GrammarParser.parse_text(string, bol=None, eof=None, reset=False, multi=False, data=None, matchtype='first')

Add string to the parse buffer and attempt to match it against the associated grammar. If successful, returns a corresponding match object. If there is an incomplete match (or it is impossible to determine yet whether the match is complete or not), save the current text in the match buffer and return None to indicate more text is required. If the text does not match any valid grammar construction, raise ParseError.

Optional parameters:
reset
Call reset() before starting to parse the supplied text.
multi
Instead of returning a single match result, keep matching as many times as possible before returning, and return a list of matches, in sequence.
eof
Indicates that no more text will be coming, and the parser should return the best match it can with the supplied text instead of asking for more. (If eof is set, the parser will never return a None result, unless the buffer is completely empty.)
data
Use the provided data instead of the default sessiondata during this parse run.
matchtype
If a grammar could match multiple ways, determine how the best match is chosen:
‘first’ (default)
The first successful match the grammar comes up with (as determined by normal grammar test ordering).
‘last’
The last successful match.
‘longest’
The match which uses up the longest portion of the input text.
‘shortest’
The match which uses up the shortest portion of the input text.
‘complete’
The first match which exactly matches the input text (i.e. there is no remainder).
‘all’
Return all possible matches, in a list. Note that in this case the buffer position will not be automatically advanced. You must call skip() manually.
bol
Treat the input text as starting at the beginning of a line (for the purposes of matching the BOL grammar element). It is not usually necessary to specify this explicitly.
GrammarParser.parse_string(string, data=None)

Assume that string is a complete self-contained text and attempt to match it (exactly) against the associated grammar. This will discard any data previously in the buffer and will require that the grammar match the entire string (i.e. no remainder is allowed).

This is equivalent to .parse_text(string, reset=True, eof=True, matchtype='complete')

GrammarParser.parse_lines(lines, bol=False, eof=False, reset=False, data=None, matchtype='first')

(generator method)

Attempt to match a list (or actually any iterable) of strings against the associated grammar. This is effectively the same as calling parse_text() repeatedly for each string in the list to obtain all matches in sequence.

Return values, exceptions, and optional parameters are all exactly the same as for parse_text().

Note

Be careful using matchtype="all" with parse_lines/parse_file. You must manually call skip() after each yielded match, or you will end up with an infinite loop!

GrammarParser.parse_file(file, encoding=None, bol=False, eof=True, reset=False, data=None, matchtype='first')

(generator method)

Open and process the contents of a file using the associated grammar. This is basically the same as opening the specified file, and passing the resulting file object to parse_lines().

file and encoding are passed directly to open(). Return values, exceptions, and other optional parameters are all exactly the same as for parse_text().

Note

Be careful using matchtype="all" with parse_lines/parse_file. You must manually call skip() after each yielded match, or you will end up with an infinite loop!

GrammarParser.remainder()

Return the remaining contents of the parse buffer. After parsing, this will contain whatever portion of the original text was not used by the parser up to this point.

GrammarParser.clear_remainder()

Clear any un-matched text left in the buffer.

GrammarParser.reset()

Reset this parser back to its initial state.

This will clear any remainder in the buffer and reset all (line, column, etc) counters to zero.

GrammarParser.skip(count)

Skip forward the specified number of characters in the input buffer (discarding the text skipped over).

Built-In Grammars

The following basic grammar classes/factories are provided from which more complicated grammars can be constructed. For those that take arguments, in addition to the arguments listed, there are a number of standard keyword arguments which can also be provided to alter the default behaviors:

modgrammar.GRAMMAR(*subgrammars, **kwargs)

Allows the construction of “anonymous grammars”, that is, creating a grammar without explicitly defining a named class derived from the Grammar superclass. This can be useful for some simple grammars where a full class definition is not necessary.

subgrammars is a list of other grammars which the new grammar should be made up of, the same as would be given as the grammar attribute in a grammar class definition.

modgrammar.G(*subgrammars, **kwargs)

This is a synonym for GRAMMAR()

modgrammar.LITERAL(string, **kwargs)

Create a simple grammar that only matches the specified literal string. Literal matches are case-sensitive.

modgrammar.L(string, **kwargs)

This is a synonym for LITERAL()

modgrammar.WORD(startchars, restchars=None, fullmatch=False, escapes=False, **kwargs)

Match any text consisting of a sequence of the specified characters. If restchars is not provided, all characters in the sequence must be in the set specified by startchars. If restchars is provided, then startchars specifies the valid options for the first character of the sequence, and restchars specifies the valid options for all following characters.

startchars and restchars are each strings containing a sequence of individual characters, or character ranges, in the same format used by python regular expressions for character-range ([]) operations (i.e. "0123456789" or "A-Za-z"). If the first character of startchars or restchars is ^, the meaning is also inverted, just as in regular expressions, so "^A-Z" would match anything except an upper-case ascii alphabet character.

If fullmatch is true, then this construct will only match the longest possible sequence of matching characters in the input, and will not try to match any shorter sub-sequences (even if they would work better in the context of the larger grammar). Effectively, this means any match returned is guaranteed to be followed by a non-word character or token.

By default, only ^ (when at the beginning) and - (when not at the beginning or end) are interpreted specially. In particular, backslashes and brackets ([ and ]) have no special meaning when in startchars or restchars and will result in the grammar matching a backslash or a bracket in the input (this is different than when specifying character-ranges in regular expressions, where these characters have special meaning). However, if escapes is True, backslash-sequences in startchars and restchars will be treated the same as in Python regular expressions. In particular, this means you can use \d, \D, \s, \S, \w, and \W, just as in regular expressions (see the documentation for the re module for more information).

modgrammar.ANY_EXCEPT(charlist, **kwargs)

Match a string of any characters except those listed in charlist.

This is functionally equivalent to WORD("^"+charlist).

modgrammar.OR(*grammars, **kwargs)

An either-or grammar that will successfully match if any of its subgrammars matches. OR() grammars can also be created by combining other grammars in python expressions using the or operator (|).

Note

Each of the possible grammars are attempted in left-to-right order. This means that if more than one of the listed grammars could potentially match, the leftmost one will always match first.

modgrammar.EXCEPT(grammar, exc_grammar, **kwargs)

Match grammar, but only if it does not also match exception_grammar. (This is equivalent to the - (exception) operator in EBNF) EXCEPT() grammars can also be created by combining other grammars in python expressions using the except operator (-).

Note

In many cases there are more efficient ways to design a particular grammar than using this construct. It is provided mostly for full EBNF compatibility.

modgrammar.REPEAT(*grammar, **kwargs)

Match (by default) one-or-more repetitions of grammar, one right after another. If the min or max keyword parameters are provided, the number of matches can be restricted to a particular range.

modgrammar.OPTIONAL(*grammar, **kwargs)

Specify that grammar is optional. It will match if present, or it will match the empty string if grammar cannot be matched.

If grammar is present in the matched input text, this element of the parse tree will contain a single grammar match object (the same as it would have if GRAMMAR(*grammar) had matched). If grammar was not found, this element of the parse tree will contain None.

This construct is functionally equivalent to OR(grammar, EMPTY). (It is is also functionally similar to REPEAT(*grammar, min=0, max=1, collapse=True), except that in the case of REPEAT(), an empty-match produces no elements at all in the resulting parse tree (not even None).)

modgrammar.ZERO_OR_MORE(*grammar, **kwargs)

This is a synonym for REPEAT(*grammar, min=0)

modgrammar.ONE_OR_MORE(*grammar, **kwargs)

This is a synonym for REPEAT(*grammar, min=1)

modgrammar.LIST_OF(*grammar, sep=", ", **kwargs)

Match a list consisting of repetitions of grammar separated by sep. As with other repetition grammars, the min and max keywords can also be used to restrict the number of matches to a certain range.

Note

Although this is most commonly used with a literal separator (such as the default ","), actually any (arbitrarily-complex) subgrammar can be specified for sep if desired.

modgrammar.NOT_FOLLOWED_BY(*grammar, **kwargs)

Returns a successful match as long as the next text after this point does NOT match the specified grammar.

When successful (that is, the next text in the input does not match the specified grammar), this element of the parse tree will contain None, and no input text will be consumed. When unsuccessful (that is, the next text does match), a ParseError will be raised.

modgrammar.ANY

Match any single character.

modgrammar.EMPTY

Match the empty string.

Note

In most cases, None is also equivalent to EMPTY

modgrammar.BOL

Match the beginning of a line.

This grammar does not actually consume any of the input text, but can be used to ensure that the next token must occur at the beginning of a new line (i.e. either the beginning of the file, or following an EOL).

modgrammar.EOL

Match the end of a line.

This grammar will match any standard line-end character or character sequence. For more information on what sequences are considered an “end of line”, see Whitespace and Newline Handling. If you need something more specific, you may just want to use LITERAL() instead.

modgrammar.EOF

Match the end of the file.

Note

This grammar will only match if the parse function is called with eof=True to indicate the end-of-file has been encountered.

modgrammar.REST_OF_LINE

Match everything up to (but not including) the next EOL.

modgrammar.WHITESPACE

Match any string of whitespace. For more information on what is considered whitespace, see Whitespace and Newline Handling.

Note

This may not match as you expect if your grammar is whitespace-consuming (see the grammar_whitespace_mode attribute).

modgrammar.SPACE

Match any string of whitespace except EOL characters. For more information on what is considered whitespace and “end of line” characters, see Whitespace and Newline Handling.

Note

This may not match as you expect if your grammar is whitespace-consuming (see the grammar_whitespace_mode attribute).

The modgrammar.extras module also contains some additional built-in grammars which can be useful in some contexts.

References

In some cases, it is necessary to refer to a portion of your grammar before it has actually been defined (for example, for recursive grammar definitions). In this case, the REF() function can be used to refer to a grammar by name, which will be resolved to an actual grammar later. (This construct can also be used to define a grammar which includes some “user-defined” sub-grammar, which the calling application can then provide at runtime.)

modgrammar.REF(ref_name, module=DEFAULT, default=None)

Create a reference to a grammar named ref_name, to be resolved later.

This can either be resolved by calling grammar_resolve_refs() prior to parsing, or, alternately, modgrammar will automatically attempt to resolve any REF() whenever it is used in parsing, and will treat it the same as if it were actually an occurrence of the resolved grammar.

By default, resolving a reference involves searching for a grammar class with the same name in the same python module. The python module is determined based on the location where the REF() call occurred. If you wish to use a different module to look for the grammar this REF() refers to, it can be provided in the module parameter. If module is given as None, then no module will be searched.

If provided, default should contain a grammar which will be used if the given reference cannot be resolved.

Constants

The following default regular expressions are provided for use with the grammar_whitespace attribute:

modgrammar.WS_DEFAULT

Will match any series of one or more whitespace characters (default)

modgrammar.WS_NOEOL

Will match any series of whitespace characters except EOL characters

(For more information on what constitutes whitespace and EOL characters, see Whitespace and Newline Handling)

Exceptions

exception modgrammar.ParseError

Raised by the parser when the provided text cannot be matched against the grammar.

This exception has several useful attributes:
grammar

The top-level grammar the parser was attempting to match.

buffer

The contents of the text buffer the parser was attempting to match against the grammar.

pos

The position within the buffer at which the problem occurred.

char

The (total parsing) character position at which the problem occurred (similar to the GrammarParser.char attribute).

line

The line at which the problem occurred (similar to the GrammarParser.line attribute).

col

The column position within the line at which the problem occurred (similar to the GrammarParser.col attribute).

expected

A list of possible sub-grammars which the parser expected to find at this position (but didn’t).

message

The text message which would be printed if this exception were printed. (This is of the form “Expected ...: Found ...”)

exception modgrammar.GrammarDefError

This exception is raised when creating/defining new grammar classes if there is a problem with the definition which cannot be resolved.

exception modgrammar.ReferenceError

This is the base class for UnknownReferenceError and BadReferenceError. It can be used to easily catch either exception.

exception modgrammar.UnknownReferenceError

An attempt was made to resolve a REF() reference, but no grammar with the given name could be found, and no default was provided in the REF() declaration.

exception modgrammar.BadReferenceError

An attempt was made to resolve a REF() reference, and the reference name was resolved to an object, but the object is not a valid grammar object.

exception modgrammar.InternalError

This exception is raised by the parser if something happens which should never happen. It usually indicates that a grammar with a custom grammar_parse() definition has done something it shouldn’t.

Miscellaneous

modgrammar.generate_ebnf(grammar, **opts)

(generator function)

Take a given grammar and produce a description of the grammar in Extended Backus-Naur Form (EBNF). This generator produces fully-formatted output lines suitable for writing to a file, etc.

As there are a few different variants of EBNF in common use, as well as some aspects which could be considered a matter of preference when producing such descriptions, this function also accepts a variety of configurable options, specified as keyword parameters:

wrap (default 80)
Wrap the output text at wrap columns.
align (default True)
Align each entry so that all of the RHSes start on the same column.
indent (default True)
The number of characters that subsequent (wrapped) lines should be indented. Can be set to either a number or to True. If set to True, the indent will be auto-calculated to line up with the position of the RHS in the first line.
expand_terminals (default False)
If grammars have subgrammars, show their expansion even if grammar_terminal is true.
special_style (default “desc”)
How some grammars (which can’t be easily represented as EBNF) should be represented inside EBNF “special sequences”. Valid options are “desc” (use the (human-readable) grammar_desc text), “name” (just use the grammar’s name), or “python” (use a repr-like syntax similar to the python syntax used to create them).

Additional options may also be offered by certain individual grammars.

Whitespace and Newline Handling

Several grammar constructs, such as SPACE, and WHITESPACE (as well as the WS_DEFAULT and WS_NOEOL regular expressions made available for use with grammar_whitespace) make reference of “whitespace characters”. Modgrammar considers a “whitespace character” to be any character which is defined in Unicode to have the “whitespace” character property (Note: This is also consistent with the behavior of the “\s” regular expression escape sequence in Python). Specifically, this includes the following characters:

Character Other Names Description
\u0009 \t, ^I, HT horizontal tab
\u000A \n, ^J, LF line feed
\u000B \v, ^K, VT vertical tab
\u000C \f, ^L, FF form feed
\u000D \r, ^M, CR carriage return
\u0020   space
\u0085 NEL next line
\u00A0   no-break space
\u1680   ogham space mark
\u180E   mongolian vowel separator
\u2000   en quad
\u2001   em quad
\u2002   en space
\u2003   em space
\u2004   three-per-em space
\u2005   four-per-em space
\u2006   six-per-em space
\u2007   figure space
\u2008   punctuation space
\u2009   thin space
\u200A   hair space
\u2028   line separator
\u2029   paragraph separator
\u202F   narrow no-break space
\u205F   medium mathematical space
\u3000   ideographic space

Note that this is a fairly generous definition of whitespace, designed to be fully Unicode, cross-language, and cross-platform conformant. In some cases, however, this may not be desirable, as many standards specify a more limited set of characters which should be considered “whitespace”. In many cases, being more liberal on input (parsing) is not a problem, and can even be beneficial, but this should be evaluated on a case-by-case basis if you are writing a parser to match a particular standard definition. Note also that for some applications it may be useful to, for example, have the parser treat \u00A0 and \u202F (non-breaking spaces) as non-whitespace (so that they won’t be interpreted as token-separators), etc.

If you wish to have a different definition for “whitespace” in your grammar, you will need to change the value of grammar_whitespace to a custom regular expression which will match a string of whitespace characters according to whatever definition of “whitespace” you are using. It may also be necessary to redefine the WHITESPACE or SPACE tokens to match, if your grammar makes use of them (or just not use them).

It is important to note that if you change grammar_whitespace, this will not automatically affect any of the other whitespace-based grammars (such as WHITESPACE), which will still match the default definition of “whitespace”. If you wish those to change as well, you will need to define your own alternative versions and use those instead of the standard ones.

End-of-Line Characters

Additionally, for EOL, REST_OF_LINE, SPACE, and WS_NOEOL, Modgrammar considers certain sequences of whitespace to also be “end of line” indicators. An “end of line” sequence is any of the following:

Sequence Other Names Description
\u000D\u000A \r\n, CRLF carriage return + line feed
\u000A\u000D \n\r, LFCR line feed + carriage return
\u000A \n, ^J, LF line feed
\u000B \v, ^K, VT vertical tab
\u000C \f, ^L, FF form feed
\u000D \r, ^M, CR carriage return
\u0085 NEL next line
\u2028   line separator
\u2029   paragraph separator

Again, this is a fairly generous list of options, designed to support as wide a range of inputs as possible automatically and usually “do the right thing”. If, however, you need a different set of “end of line” sequences, you may need to define your own versions of some of the above constructs instead of using the default ones provided by Modgrammar.

Note that if you redefine EOL, etc, this will not change how the parser calculates line numbers when parsing input, so the line and col parser attributes (as well as line/column information in ParseErrors) will likely not match up with your language’s actual interpretation of where lines end (Unfortunately, there is currently no support for changing how the parser interprets line boundaries).