Fro 101¶
Parser
objects are the workhorses of the fro module, providing all of the module’s core parsing functionality.
However, there are a few subtleties to how they operate, which will be discussed in this section.
Parsing via chomping¶
Conceptually, when a Parser
object parses a string, it first attempts to “consume” an initial portion of the string,
and from it produce a value. We will refer to this process (consuming an initial portion of the string and producing a
value) as the parser “chomping” the string. The terminology evokes a useful mental image, and may also be part of an
elaborate scheme to make a Noam Chomsky pun.
If the parser is not successful in chomping the string, then it fails to parse it. Otherwise, the parser is successful
in parsing the string if (and only if) it consumes the entire string during chomping. This conceptual model will be
useful for understanding what happens when Parser
objects are combined into new parsers.
As an example, fro.intp
is a parser which consumes non-empty sequences of digits (and other things
you might expect to see in an integer, like a leading minus sign) and produces the corresponding int
values. When parsing the string "2358"
it will consume the initial portion "2358"
(which happens to be the entire string), and from it produces
the int
value 2358
. Since the parser consumes the entire string during chomping, the parse is successful.:
fro.intp.parse_str("2358") # chomps "2358" producing 2358, a successful parse!
When parsing the string "123abc"
, it will consume the initial portion "123"
, producing the int
value
123
. However, the parse will be unsuccessful since the remaining "abc"
was not consumed.:
fro.intp.parse_str("123abc") # chomps "123" producing 123, an unsuccessful parse
Finally, when parsing the string "abc123"
, it will not be able to consume any part of the string, since there is no
initial portion that contains only digits. Since fro.intp
only consumes non-empty sequences of digits, it cannot
consume an empty initial portion (which parsers are in general allowed to do).:
fro.intp.parse_str("abc123") # cannot chomp anything, an unsuccessful parse
Combining parsers: composition¶
With this model in place, it is much easier to make sense of what happens when you combine parsers together.
As a case study, let’s consider parser composition. Given two parsers p1
and p2
, the composition of p1
and
p2
is a new parser, separate from but dependent on p1
and p2
. When chomping a string, it first has p1
attempt to chomp. If p1
successfully chomps, then p2
chomps on the remaining, unconsumed portion of the string.
If either p1
or p2
is unable to chomp, the composition does not successfully chomp. If both p1
and p2
are able to chomp and produce values v1
and v2
respectively, then the composition consumes both of the portions
that p1
and p2
consumed and produces the tuple (v1, v2)
.
As an example, consider the following:
a_to_z = fro.rgx(r"[a-z]*") # consumes, and then produces, strings that match regex
composition_parser = fro.comp([fro.intp, az_parser]) # composition of fro.intp and a_to_z
composition_parser.parse_str("2357primes")
When composition_parser
tries to chomp "2357primes"
, first fro.intp
will consume "2357"
(of off
"2357primes"
) and produce 2357
, then az_parser
will consume "primes"
(of off "primes"
, which is
what remains after fro.intp
‘s chomp) and produces "primes"
. Therefore, composition_parser
consumes "2357primes"
and produces the tuple (2357, "primes")
. Since composition_parser
chomps the entire
string, it parses successfully.
As another example:
twoints_parser = fro.comp([fro.intp, fro.intp])
twoints_parser.parse_str("149") # what will happen??
When twoints_parser
tries to chomp "149"
, the first fro.intp
will consume "149"
and produce 149
.
However, there will nothing left for the second fro.intp
to consume, so it will not successfully chomp anything.
Since the second fro.intp
cannot chomp, the composition fails to chomp, and thus fails to parse.
This example highlights an important property of fro parsers: parsers chomp myopically. If the first fro.intp
in the
above example had known that it was followed by another fro.intp
, it perhaps could have only chomped the first two digits,
leaving the "9"
for the second fro.intp
, and the composition could have produced the tuple (14, 9)
. However,
fro.intp
is based on the regular expression [0-9]+
, which matches as many digits as possible, so it consumes
as many digits as possible while chomping.
Finally, you can compose more than two parsers together. Consider the following:
composition = fro.comp([fro.intp, fro.rgx(r"@"), fro.intp, fro.rgx(r"@"), fro.intp])
composition.parse_str("123@45@6") # returns the tuple (123, "@", 45, "@", 6)
When composition
attempts to chomp "123@45@6"
, the first fro.intp
consumes "123"
and produces 123
.
Then, the remaining, unconsumed "@45@6"
is given to the first r"@"
parser to chomp, which consumes and produces
"@"
. After this, the remaining, unconsumed "45@6"
is given to the second fro.intp
, so it continues for each
of the composition’s children parsers. The children parsers produces the values 123
, "@"
, 45
, "@"
, and
6
respectively, so the composition produces the tuple (123, "@", 45, "@", 6)
.
Parser significance¶
Sometimes, when we compose multiple parsers together, some of the values produced by the children parsers are not important. In the above
example, we might only care about the three int values, and not about the delimiting "@"
values.
To exclude some produced values from the resulting tuple of a composition, we can mark some of it’s children as
insignificant:
composition = fro.comp([fro.intp, ~fro.rgx(r"@"), fro.intp, ~fro.rgx(r"@"), fro.intp])
composition.parse("123@45@6") # returns the tuple (123, 45, 6)
Here, the ~fro.rgx(r"@")
evaluates to an insignificant version of the parser fro.rgx(r"@")
; for more on the syntax of marking parsers as
significant or insignificant, see Fro Interface. What is important to notice is that the "@"
value are not present in the tuple
value produced by composition
.
Parsers are significant by default. If a parser is insignificant, that only means that the values it produces will be ignored when it appears inside a
composition. An insignificant parser will still produce a value if you directly call parse(..)
on it, or if it
appears in something other than a composition.
Chunks¶
When parsing a large text file, it’s preferable to not have to read the entire file into memory. Instead, iterating through the file one piece at a time leads to much better use of memory.
To support efficient use of memory, Fro breaks the text it is parsing into “chunks”. When a parser chomps, it chomps on one chunk at a time, and only moves to the next chunk once it has completely consumed the current one. Regular expression parsers, which are used to construct practically every other type of parser, can only operate inside a single chunk.
Let’s consider an example:
composition = fro.comp([r"abc", r"def"])
# the first argument to parse(..) is a collection of chunks to parse
composition.parse(["abcd", "ef"])
# This parse will fail. The first regex parser will chomp "abc" off of the first
# chunk "abcd", leaving "d" behind. The second regex parser will try to chomp off
# the remainder of the first chunk, but fail. Since regular expressions cannot
# "wrap" around to the next chunk if the current chunk is not fully consumed, it
# does not matter that "ef" are waiting for us in the second chunk.
By default, the lines of a text file serve as the chunks of a file. If you want to
split your input text into chunks some other way, you can pass any iterable collection of
strings into a parser’s parse(..)
method, and the parser will treat each element
as an individual chunk.
Since a parser can only move onto the next chunk after it has completely consumed the current chunk, it is important that a parser can unambiguously decide how to parse a chunk before moving onto the next one. To make this more concrete, let’s consider another example:
a_then_b = fro.comp([r"a", r"b"])
a_then_c = fro.comp([r"a", r"c"])
# fro.alt(..) constructs an alternation parser, see the docs for more info
ab_or_ac = fro.alt([a_then_b, a_then_c])
ab_or_ac.parse(["a", "c"])
# This parse will fail. The ab_or_ac parser will first try to parse with a_then_b.
# The r"a" regex will chomp the entire first chunk ("a"). When the r"b" regex tries
# to chomp the second chunk ("c"), it will fail. At this point, the parser has
# already advanced to the second chunk, so it has no way of returning to the first
# chunk to try chomping with a_then_c, so it fails immediately.
In the above example, the parser can’t know how to interpret the "a"
in the first chunk without
looking at the second chunk. That is, the parser doesn’t know if the "a"
is part of something that
a_then_b
will recognize, or part of something that a_then_c
will recognize. In this case,
it blindly chooses a_then_b
, and fails.
What about all of the above examples where we just called parser.parse_str
,
and didn’t worry about chunks? When called with the parse_str
method, a parser
treats the entire string as a single chunk.