Tutorial I: A guided JSON parser:

The JSON spec here: http://json.org could be easily describe using pyrser.

1- Starting block

To describe a file format you just need to write a class that inherits from pyrser.grammar.Grammar:

from pyrser import grammar

class JSON(grammar.Grammar):
    """Our future JSON parser"""
    pass

This empty class is just a container for the BNF description of our file format. Pyrser uses just 2 class variable to hold BNF:

  • grammar: a docstring containing the BNF description.
  • entry: the name of the rule to use as entry point.

2- Translate BNF

In the JSON spec, the first rule object is describe as:

object
    {}
    { members }

This describes a rule name object as members surrounded by braces. members could be empty. That BNF could be literaly translate as:

from pyrser import grammar

class JSON(grammar.Grammar):
    """Our JSON parser"""
    entry = "object"
    grammar = """
        object = [
            '{' '}'
            | '{' members '}'
        ]
    """

But Pyrser BNF syntax provides the repeater ? that allows you to describe object in a more concise way.

grammar = """
    object = [ '{' members? '}' ]
"""

Using Writing a BNF grammar, we translate completly the grammar.

3- BNF of list

In the JSON spec, a common pattern is used to describe list of items. example:

members
    pair
    pair , members

elements
    value
    value , elements

This kind of parser uses right recursivity to create list of item. Pyrser parsing engine uses PEG (Parsing Expression Grammar) mechanism. It’s better to use repeater + or * to describe the list.

4- Basic JSON Parser

With these advices, we could translate all the BNF:

from pyrser import grammar
from pyrser.directives import ignore

class JSON(grammar.Grammar):
    """Our JSON parser"""
    entry = "json"
    grammar = """
        json = [ object eof ]

        object = [ '{' members? '}' ]

        members = [ pair [',' pair]* ]

        pair = [ string ':' value ]

        value =
        [
            string
            | number
            | object
            | array
            | "true"
            | "false"
            | "null"
        ]

        array = [ '[' elements? ']' ]

        elements = [ value [',' value]* ]

        number = [ @ignore("null") [int frac? exp?] ]

        int = [ '-'?
            [
                digit1_9s
                | digit
            ]
        ]

        frac = [ '.' digits ]

        exp = [ e digits ]

        digit = [ '0'..'9' ]

        digit1_9 = [ '1'..'9' ]

        digits = [ digit+ ]

        digit1_9s = [ digit1_9 digits ]

        e = [ ['e'|'E'] ['+'|'-']? ]
    """

note 1: We could notice the use of @ignore("null") in the rule number. This directive allow you to change ignore convention.

See Setting Directives: Module directives for more informations about directives.

note 2: We don’t provide the string and eof rules because these rules are default rules provided by inheritance from the grammar Grammar.

See Base of all parser for more informations about what is provided by default and how composition work.

5- Building an AST

The aim of parsing is to translate a textual representation of information into data structures representation or AST(for Abstract Syntax Tree). A tree constructs to represent all abstractions provided by the syntax. Here we need to translate JSON into python objects. To do this, we want to fetch data during the parsing process and create objects on the fly by calling some python chunks of code.

Pyrser provides to us two mechanisms:

  • hooks for event handling
  • nodes for data handling

Let’s focus on the number rule. We want to capture the number and convert it in float.

nodes

To capture the result of a rule just suffix it by ‘:’ and names it:

"""
...
    number = [ @ignore("null") [int frac? exp?]:n ]
...
"""

This will create a new node named n.

hooks

To do something on n just send it thru a hook named is_num to some python code. Just call a hook after reading string:

"""
...
    number = [ @ignore("null") [int frac? exp?]:n #is_num(n) ]
...
"""

By default is_num is an unknown hook. Let’s declare it with the following syntax:

from pyrser import meta

@meta.hook(JSON)
def is_num(self, arg):
    print(self.value(arg))
    return True

note: A hook is just a function with a special decorator:

  • The function took at least one parameter self. This is the parser instance (here your JSON instance).
  • arg is the capturing node (an instance of pyrser.parsing.node.Node).

We could fetch the captured text (parsed by [int frac? exp?]) with a call to self.value on the arg.

note: A hook must return True if the parsing must continue. You could stop parsing by returning False (this return provoking a parse error).

See Setting Hooks: Module hooks for more informations about hooks.

See Building AST (Abstract Syntax Tree): Module node for more informations about nodes.

return values

Well, we could capture data from the input and do something on it. But how returned to the caller our results? For this, we must use the special node named _. Indeed, _ is bound to the rule resulting node. So, we must patch our number rule and the is_num hook like this:

...
"""
    ...
        number = [ @ignore("null") [int frac? exp?]:n #is_num(_, n) ]
    ...
"""
...

_ is received by the is_num function as parameter. You can’t modify it directly. To return something with it you must create an arbitrary attribute to carry the output:

from pyrser import meta

@meta.hook(JSON)
def is_num(self, ast, arg):
    # node is arbitrary
    ast.node = float(self.value(arg))
    return True

note: The float constructor interpret directly self.value(arg) like 1.0 or -2e+2 to create a float object.

We could proceed like this for all trivial values.

Sometime, we only want to transfert the result of a subrule as the result of the current rule. For this, just use the bind operator :> that connect the output to an existing node:

...
"""
    ...
        value =
        [
            [number | object | array]:>_
            ...
        ]
    ...
"""
...

handling arrays

Let’s focus on a more complex case, the array rule:

array = [ '[' elements? ']' ]

elements = [ value [',' value]* ]

These kind of rules are not really optimized for a PEG parser. It’s better to have, in the same rule, the resulting node (array) and the list of items (list of value). We could merge this two rules into one:

array = [ '[' [value [',' value]* ]? ']' ]

In this form, it’s easier to identify where to put a hook to create a python array, and where to put a hook to add item into this array:

array = [ '[' #is_array(_) [value:v #add_item(_, v) [',' value:v #add_item(_, v) ]* ]? ']' ]

With the following hooks:

@meta.hook(JSON)
def is_array(self, ast):
    ast.node = []
    return True

@meta.hook(JSON)
def add_item(self, ast, item):
    ast.node.append(item.node)
    return True

We could proceed in the same way for the rule object.

6- Final JSON parser

A complete grammar for a JSON parser looks like this:

from pyrser import grammar, meta
from pyrser.directives import ignore

class JSON(grammar.Grammar):
    """Pyrser JSON parser"""
    entry = "json"
    grammar = """
    json =[ object:>_ eof ]

    object =
    [
        '{' #is_dict(_) [pair:p #add_kv(_, p) [',' pair:p #add_kv(_, p) ]*]? '}'
    ]

    pair = [ string:s ':' value:v #is_pair(_, s, v) ]

    value =
    [

        [number | object | array]:>_
        | [
            string:s #is_str(_, s)
            | "true":t #is_bool(_, t)
            | "false":f #is_bool(_, f)
            | "null" #is_none(_)
        ]

    ]

    array =
    [
        '[' #is_array(_) [value:v #add_item(_, v) [',' value:v #add_item(_, v)] *]? ']'
    ]

    number = [ @ignore("null") [int frac? exp?]:n #is_num(_, n) ]

    int =
    [
        '-'?
        [
            digit1_9s
            | digit
        ]
    ]

    frac = [ '.' digits ]

    exp = [ e digits ]

    digit = [ '0'..'9' ]

    digit1_9 = [ '1'..'9' ]

    digits = [ digit+ ]

    digit1_9s = [ digit1_9 digits]

    e = [ ['e'|'E'] ['+'|'-']? ]

    """

@meta.hook(JSON)
def is_num(self, ast, n):
    ast.node = float(self.value(n))
    return True

@meta.hook(JSON)
def is_str(self, ast, s):
    ast.node = self.value(s).strip('"')
    return True

@meta.hook(JSON)
def is_bool(self, ast, b):
    bval = self.value(b)
    if bval == "true":
        ast.node = True
    if bval == "false":
        ast.node = False
    return True

@meta.hook(JSON)
def is_none(self, ast):
    ast.node = None
    return True

@meta.hook(JSON)
def is_pair(self, ast, s, v):
    ast.node = (self.value(s).strip('"'), v.node)
    return True

@meta.hook(JSON)
def is_array(self, ast):
    ast.node = []
    return True

@meta.hook(JSON)
def add_item(self, ast, item):
    ast.node.append(item.node)
    return True

@meta.hook(JSON)
def is_dict(self, ast):
    ast.node = {}
    return True

@meta.hook(JSON)
def add_kv(self, ast, item):
    ast.node[item.node[0]] = item.node[1]
    return True

7- Parser in action

Using the JSON class is really easy.

Instanciate it and use the method parse (or parse_file) to parse a content:

json = JSON()
res = json.parse("""
    {
        "test" : 12,
        "puf" : [1, 2, 3]
    }
""")
if res.node['puf'][1] == 2:
    print("OK")

You could also put all your grammar into a BNF file (here json.bnf) use the from_file function to create the JSON class:

import pyrser.grammar
JSON = grammar.from_file("json.bnf")

See Grammar Base Class: Module grammar for more informations about way of creating grammar.

See Error handling: Module error if something goes wrong in your grammar.