A Brief Tutorial

A Pratt parser is concerned with many objects called symbols. Each symbol has at least four components:

  • An id to recognize what kind of construct or token it stands for.
  • A left binding power, short lbp, that basically describes its precedence in infix form (most notably, operator precedence).
  • A “null denotation”, short nud, which is a scary name for the action that should happen when the symbol is encountered as start of an expression (most notably, prefix operators).
  • A “left denotation”, short led, which is a scary name for the action that should happen when the symbol is encountered after the start of an expression (most notably, infix and postfix operators).

In the following examples, we will construct a parser for an increasingly complex language that has many features of modern programming languages. We will start with a very simple grammar and input which seems quite obvious to parse. But don’t be fooled - it cannot be parsed right away by some popular and useful parsing algorithms. Recursive descent parsers, for instance, would fall into infinite recursion trying to evaluate the leftmost option (expr) first. Other parsers, such as LALR parsers, have zero problem with left recursion and work really fast, but are impractical to write by hand and usually can’t give good error messages.

First Example

Here’s our inital grammar and input:

expr ::= expr '+' expr | INT
1 + 2 + 3 + 4

The Pratt parser, while pretty much a top-down parser and in part looking very much like a recursive descent parser, has the best of both worlds. It’s really simple to write by hand and can therefore provide good error messages if written to do so. It doesn’t get caught up in left recursion, but only makes a handful of function calls per token.

And the best: You don’t need to repetively write the primitives (symbol management, the loop putting it all together in the end). You can simply use the helpers provides by this library and concentrate on the meat of the parser. You declare symbols and their binding power, write down semantic actions without the glue code adding them to the mixture, etc. You also get helpers to make common kinds of actions, such as operators. The parser for the above grammar (just integers and addition) is as simple as this:

from prattparse import parser, symbol

# manages input stream and provides ways to advance it
# also has a table where we register all our symbols
p = parser.ParserSkeleton()
# provides helper functions to cut down boilerplate in action definitions
h = symbol.Helpers(p)

# define the binding power of tokens that should have a non-zero one
h.bp_spec('''
    PLUS 10
''')
# a literal is a symbol with a do-nothing .nud(), it simply returns itself
# and awaits to be picked up by higher-level constructs such as operators
h.literal('INT')
# a left-associative infix operator simply stores what's left of him (the
# first operand) in a member called .first, and then asks the parser to
# figure out its right (second) operand, which is an expression containing
# only infix symbols of a higher binding power.
h.infix('PLUS')

And that’s it. I don’t know about you, but I consider 8 lines of code (not counting comments and blank lines) pretty impressive for a parser. Using the parser and reporting errors is equally simple:

# yes, this is guaranteed to be safe in this particular case IF you
# import * only from this particular module
from prattparse.common import *

# ... parser definition, as above
# ... getting tokens, see next paragraph

p.input(ts)
try:
    print(p.expression())
except ParseError as e:
    print(e)

Note that Pratt parsers usually require a seperate tokenizer/lexical analyzer/scanner to break the input string into tokens. Since the construction of lexers is outside the scope of this library, we’ll create the tokens by hand for examples. (In a real project, you’ll naturally use a lexer generator such as PLY, or write a lexer by hand.) We’ll use this simple helper function to construct symbols directly without having an input string and a tokenizer:

def S(sym_id, val=None):
    s = h.sym_obj(sym_id) # create instance of the class with the given id
    if val is not None:
        s.first = val
    return s

Using this, we conver the above input to this:

[S('INT', 1), S('PLUS'), S('INT', 2), S('PLUS'), S('INT', 3), S('PLUS'), S('INT', 4)]

Applying this to the first example input, the string conversion of the standard symbol class makes the output look like this:

<Symbol PLUS <Symbol PLUS <Symbol PLUS <Symbol INT 1>, <Symbol INT 2>>, <Symbol INT 3>>, <Symbol INT 4>>

Note how the operator is correctly treated as left-associative. Now, removing the last symbol makes the input invalid. The parser recognizes this, and prints an error message:

Parse error in '<unknown source>' at ?:?: Unexpected end of token stream

Likewise, omitting an INT between two PLUS tokens leads to this message:

Parse error in '<unknown source>' at ?:?: Unexpected PLUS

Granted, those aren’t the most useful messages in the world. But you can customize what gets printed instead of the symbol id, so the second variant can become quite useful in general. That’s why you are adviced to:

  1. Add a custom end-of-input marker token, as that leads a message like the second rather than the arcane first one.
  2. Provide a dictionary to give the user neater messages like “Unexpected end of file” instead of “Unexpected EOF”.

More on customizing the default error messages (and adding completely new error messages) later.

More Levels Of Precedence

The first example works well, but that shouldn’t be a surprise, as the grammar was very simple. So simple, in fact, that one could have parsed it with a regular expression. Let’s move on to something bigger - we’ll add:

  • A multiplication operator, which should have higher precedence than addition.
  • A power/expotentiation operator with even higher precedence, which is also right-associative.
  • Parenthesis to break out of the standard precedence rules and give a subexpression highest precedence.

Doing this in a recursive descent parser would be cumbersome, as we’d have to rewrite the obvious grammar structure to something more complicated to avoid left recursion and handle precedence correctly. With a Pratt parser, it’s incredibly simple. We add binding power declarations for the new tokens, define their actions in the same fashion as with our existing operator, and are done:

h.bp_spec('''
    LPAREN 100
    POW 30
    MUL 20
    ADD 10
''')
# like infix, but for multiple operators at once
h.infixes('ADD MUL')
# like infix, only right-associative
h.infix_r('POW')
h.literal('INT')
# declare that it's there, but don't give it any actions so its presence
# is usually considered an error
p.symbol('RPAREN')

# attach the following function to the LPAREN symbol as .nud() method
# i.e. call this when encountering an '('
@h.nud_of('LPAREN')
def p_lparen(self):
    # get expression inside - implicit: stop at symbol with a binding power
    # of 0, e.g. RPAREN
    expr = p.expression()
    # check if the upcoming symbol is an RPAREN symbol
    # if so, just throw it away - otherwise, raise a parse error
    p.advance('RPAREN')
    # return inner expression without wrapping it into anything
    # (parens are ignored in the AST)
    return expr

Feeding it input like 5 * 2**3**(2 + 2) gets the precedence and associativity right:

# input: [S('INT', 5), S('MUL'), S('INT', 2), S('POW'), S('INT', 3), S('POW'), S('LPAREN'), S('INT', 2), S('ADD'), S('INT', 2), S('RPAREN')]
<Symbol MUL <Symbol INT 5>, <Symbol POW <Symbol INT 2>, <Symbol POW <Symbol INT 3>, <Symbol ADD <Symbol INT 2>, <Symbol INT 2>>>>>

This parser can propably be written from scratch, by hand, in less than hundred lines without resorting to dirty tricks. With the tools of this library, we needed a laughable 18 lines, and neither of us had to think much to make it respect precedence. The line ratio isn’t nearly as extreme when more complex constructs are added, which need custom actions anyway. But at least you don’t have to write the repetive helper functions.

Adding Statements

Many, many programming languages also have a notion of statements, constructs which encompass at least one logical line and don’t have a value. Most such statements begin are introduced by specific tokens that won’t appear at the start of expressions. Therefore, a wise man (Douglas Crockford) decided to add a third action to tokens and use that to identify statements. Since this is a good idea, we’ll do the same. We could try to do this with nud as well, but this way we can reject many cases of statements used as expressions right away. This way, when an expression is expected but the upcoming token indicates a statement, we get the appropriate “Unexpected (return/def/whatever)”.

As having statements at all is an extension of the basic algorithm that’s not necessarily needed, support for it lives in its own module. This module define the additional methods for module usage. It also exports its own ParserSkeleton and Helpers classes that combine the new methods with the basic classes from prattparse.parser and prattparse.symbol. There’s also a new default symbol class which extends the basic one with the parts needed by statement parsers. Therefore, out imports change:

# old
from prattparse import parser, symbol
# ... use parser.ParserSkeleton and symbol.Helpers

# new
from prattparse import statement as statement
# use statement.ParserSkeleton and statement.Helpers

There are only a few additions, they are easily memorized:

  • Symbols must have a std member, either a zero-argument action (like nud) or NotImplemented to indicate that this token doesn’t start an expression.
  • The symbol returned as result from statement must also have a boolean need_terminator (since abritary expressions can be statements, you’ll need this on pretty much every symbol).
  • The parser now requires a positional argument, the id of the symbol that’s a statement terminator.
  • The parser has a statement() method, which parses a statement or an expression (implicitly followed by a statement terminator if result.need_terminator is true).
  • The helper provides a std_of(id) decorator very similar to nud_of and led_of.
  • The helper also provides the no_terminator declaration, which takes a string of whitespace-seperated ids and sets the associated classes’ need_terminator to false. This is mostly useful for statements containing blocks of statements, as blocks usually are delimited by their own tokens, not terminated like single statements.

As an example, we’ll add two statements to the above language: def to define functions and return. For simplicity, we’ll assume all functions take a single argument. The action definitions go like this:

from prattparse import statement as statement
from prattparse.common import *

p = statement.ParserSkeleton('NEWLINE')
h = statement.Helpers(p)

# ... all the definitions from above

h.single_expr_stmt('RETURN') # store single expression in self.first
h.no_terminator('DEF')
h.literal('IDENTIFIER')
# markers, required for structure but without their own actions
h.symbol_decls('BLOCK INDENT DEDENT')

# a "free" action, to be called from other actions but not by the parser itself
def p_block():
    self = h.sym_obj('BLOCK')
    self.first = []
    p.advance('INDENT')
    while p.current.id != 'DEDENT':
        self.first.append(p.statement())
    p.advance('DEDENT')
    return self

@h.std_of('DEF')
def p_def(self):
    # the DEF is already consumed at this point
    # the current token is the one following the DEF
    self.first = p.advance('IDENTIFIER') # function name
    p.advance('LPAREN')
    self.second = p.advance('IDENTIFIER') # parameter name
    p.advance('RPAREN')
    self.third = p_block() # function body
    return self

This input:

[S('DEF'), S('IDENTIFIER', "f"), S('LPAREN'), S('IDENTIFIER', "x"), S('RPAREN'), S('RETURN'), S('IDENTIFIER', "x"), S('PLUS'), S('IDENTIFIER', "x")]

leads to output like this:

<Symbol DEF <Symbol IDENTIFIER 'f'>, <Symbol IDENTIFIER 'x'>, <Symbol BLOCK [<Symbol PLUS <Symbol IDENTIFIER 'x'>, <Symbol IDENTIFIER 'x'>>]>>

Table Of Contents

Previous topic

Quickstart guide

This Page