================ A Brief Tutorial ================ A Pratt parser is concerned with many objects called *symbols*. Each symbol has at least four components: - An *id* to recognize what kind of construct or token it stands for. - A left binding power, short *lbp*, that basically describes its precedence in infix form (most notably, operator precedence). - A "null denotation", short *nud*, which is a scary name for the action that should happen when the symbol is encountered as start of an expression (most notably, prefix operators). - A "left denotation", short *led*, which is a scary name for the action that should happen when the symbol is encountered after the start of an expression (most notably, infix and postfix operators). In the following examples, we will construct a parser for an increasingly complex language that has many features of modern programming languages. We will start with a very simple grammar and input which seems quite obvious to parse. But don't be fooled - it cannot be parsed right away by some popular and useful parsing algorithms. Recursive descent parsers, for instance, would fall into infinite recursion trying to evaluate the leftmost option (expr) first. Other parsers, such as LALR parsers, have zero problem with left recursion and work really fast, but are impractical to write by hand and usually can't give good error messages. First Example ============= Here's our inital grammar and input:: expr ::= expr '+' expr | INT 1 + 2 + 3 + 4 The Pratt parser, while pretty much a top-down parser and in part looking very much like a recursive descent parser, has the best of both worlds. It's really simple to write by hand and can therefore provide good error messages if written to do so. It doesn't get caught up in left recursion, but only makes a handful of function calls per token. And the best: You don't need to repetively write the primitives (symbol management, the loop putting it all together in the end). You can simply use the helpers provides by this library and concentrate on the meat of the parser. You *declare* symbols and their binding power, write down semantic actions without the glue code adding them to the mixture, etc. You also get helpers to make common kinds of actions, such as operators. The parser for the above grammar (just integers and addition) is as simple as this:: from prattparse import parser, symbol # manages input stream and provides ways to advance it # also has a table where we register all our symbols p = parser.ParserSkeleton() # provides helper functions to cut down boilerplate in action definitions h = symbol.Helpers(p) # define the binding power of tokens that should have a non-zero one h.bp_spec(''' PLUS 10 ''') # a literal is a symbol with a do-nothing .nud(), it simply returns itself # and awaits to be picked up by higher-level constructs such as operators h.literal('INT') # a left-associative infix operator simply stores what's left of him (the # first operand) in a member called .first, and then asks the parser to # figure out its right (second) operand, which is an expression containing # only infix symbols of a higher binding power. h.infix('PLUS') **And that's it.** I don't know about you, but I consider 8 lines of code (not counting comments and blank lines) pretty impressive for a parser. Using the parser and reporting errors is equally simple:: # yes, this is guaranteed to be safe in this particular case IF you # import * only from this particular module from prattparse.common import * # ... parser definition, as above # ... getting tokens, see next paragraph p.input(ts) try: print(p.expression()) except ParseError as e: print(e) Note that Pratt parsers usually require a seperate tokenizer/lexical analyzer/scanner to break the input string into tokens. Since the construction of lexers is outside the scope of this library, we'll create the tokens by hand for examples. (In a real project, you'll naturally use a lexer generator such as PLY, or write a lexer by hand.) We'll use this simple helper function to construct symbols directly without having an input string and a tokenizer:: def S(sym_id, val=None): s = h.sym_obj(sym_id) # create instance of the class with the given id if val is not None: s.first = val return s Using this, we conver the above input to this:: [S('INT', 1), S('PLUS'), S('INT', 2), S('PLUS'), S('INT', 3), S('PLUS'), S('INT', 4)] Applying this to the first example input, the string conversion of the standard symbol class makes the output look like this:: , >, >, > Note how the operator is correctly treated as left-associative. Now, removing the last symbol makes the input invalid. The parser recognizes this, and prints an error message:: Parse error in '' at ?:?: Unexpected end of token stream Likewise, omitting an INT between two PLUS tokens leads to this message:: Parse error in '' at ?:?: Unexpected PLUS Granted, those aren't the most useful messages in the world. But you can customize what gets printed instead of the symbol id, so the second variant can become quite useful in general. That's why you are adviced to: 1. Add a custom end-of-input marker token, as that leads a message like the second rather than the arcane first one. 2. Provide a dictionary to give the user neater messages like "Unexpected end of file" instead of "Unexpected EOF". More on customizing the default error messages (and adding completely new error messages) later. More Levels Of Precedence ========================= The first example works well, but that shouldn't be a surprise, as the grammar was *very* simple. So simple, in fact, that one could have parsed it with a regular expression. Let's move on to something bigger - we'll add: - A multiplication operator, which should have higher precedence than addition. - A power/expotentiation operator with even higher precedence, which is also right-associative. - Parenthesis to break out of the standard precedence rules and give a subexpression highest precedence. Doing this in a recursive descent parser would be cumbersome, as we'd have to rewrite the obvious grammar structure to something more complicated to avoid left recursion and handle precedence correctly. With a Pratt parser, it's incredibly simple. We add binding power declarations for the new tokens, define their actions in the same fashion as with our existing operator, and are done:: h.bp_spec(''' LPAREN 100 POW 30 MUL 20 ADD 10 ''') # like infix, but for multiple operators at once h.infixes('ADD MUL') # like infix, only right-associative h.infix_r('POW') h.literal('INT') # declare that it's there, but don't give it any actions so its presence # is usually considered an error p.symbol('RPAREN') # attach the following function to the LPAREN symbol as .nud() method # i.e. call this when encountering an '(' @h.nud_of('LPAREN') def p_lparen(self): # get expression inside - implicit: stop at symbol with a binding power # of 0, e.g. RPAREN expr = p.expression() # check if the upcoming symbol is an RPAREN symbol # if so, just throw it away - otherwise, raise a parse error p.advance('RPAREN') # return inner expression without wrapping it into anything # (parens are ignored in the AST) return expr Feeding it input like ``5 * 2**3**(2 + 2)`` gets the precedence and associativity right:: # input: [S('INT', 5), S('MUL'), S('INT', 2), S('POW'), S('INT', 3), S('POW'), S('LPAREN'), S('INT', 2), S('ADD'), S('INT', 2), S('RPAREN')] , , , , >>>> This parser can propably be written from scratch, by hand, in less than hundred lines without resorting to dirty tricks. With the tools of this library, we needed a laughable 18 lines, and neither of us had to think much to make it respect precedence. The line ratio isn't nearly as extreme when more complex constructs are added, which need custom actions anyway. But at least you don't have to write the repetive helper functions. Adding Statements ================= Many, many programming languages also have a notion of *statements*, constructs which encompass at least one logical line and don't have a value. Most such statements begin are introduced by specific tokens that won't appear at the start of expressions. Therefore, a wise man (Douglas Crockford) decided to add a third action to tokens and use that to identify statements. Since this is a good idea, we'll do the same. We *could* try to do this with ``nud`` as well, but this way we can reject many cases of statements used as expressions right away. This way, when an expression is expected but the upcoming token indicates a statement, we get the appropriate "Unexpected (return/def/whatever)". As having statements at all is an extension of the basic algorithm that's not necessarily needed, support for it lives in its own module. This module define the additional methods for module usage. It also exports its own ``ParserSkeleton`` and ``Helpers`` classes that combine the new methods with the basic classes from ``prattparse.parser`` and ``prattparse.symbol``. There's also a new default symbol class which extends the basic one with the parts needed by statement parsers. Therefore, out imports change:: # old from prattparse import parser, symbol # ... use parser.ParserSkeleton and symbol.Helpers # new from prattparse import statement as statement # use statement.ParserSkeleton and statement.Helpers There are only a few additions, they are easily memorized: - Symbols must have a ``std`` member, either a zero-argument action (like ``nud``) or ``NotImplemented`` to indicate that this token doesn't start an expression. - The symbol returned as result from ``statement`` must also have a boolean ``need_terminator`` (since abritary expressions can be statements, you'll need this on pretty much every symbol). - The parser now requires a positional argument, the id of the symbol that's a statement terminator. - The parser has a ``statement()`` method, which parses a statement or an expression (implicitly followed by a statement terminator if ``result.need_terminator`` is true). - The helper provides a ``std_of(id)`` decorator very similar to ``nud_of`` and ``led_of``. - The helper also provides the ``no_terminator`` declaration, which takes a string of whitespace-seperated ids and sets the associated classes' ``need_terminator`` to false. This is mostly useful for statements containing blocks of statements, as blocks usually are delimited by their own tokens, not terminated like single statements. As an example, we'll add two statements to the above language: ``def`` to define functions and ``return``. For simplicity, we'll assume all functions take a single argument. The action definitions go like this:: from prattparse import statement as statement from prattparse.common import * p = statement.ParserSkeleton('NEWLINE') h = statement.Helpers(p) # ... all the definitions from above h.single_expr_stmt('RETURN') # store single expression in self.first h.no_terminator('DEF') h.literal('IDENTIFIER') # markers, required for structure but without their own actions h.symbol_decls('BLOCK INDENT DEDENT') # a "free" action, to be called from other actions but not by the parser itself def p_block(): self = h.sym_obj('BLOCK') self.first = [] p.advance('INDENT') while p.current.id != 'DEDENT': self.first.append(p.statement()) p.advance('DEDENT') return self @h.std_of('DEF') def p_def(self): # the DEF is already consumed at this point # the current token is the one following the DEF self.first = p.advance('IDENTIFIER') # function name p.advance('LPAREN') self.second = p.advance('IDENTIFIER') # parameter name p.advance('RPAREN') self.third = p_block() # function body return self This input:: [S('DEF'), S('IDENTIFIER', "f"), S('LPAREN'), S('IDENTIFIER', "x"), S('RPAREN'), S('RETURN'), S('IDENTIFIER', "x"), S('PLUS'), S('IDENTIFIER', "x")] leads to output like this:: , , , >]>>