Home | Trees | Indices | Help |
|
---|
|
pyparsing module - Classes and methods to define and execute parsing grammars
The pyparsing module is an alternative approach to creating and executing simple grammars, vs. the traditional lex/yacc approach, or the use of regular expressions. With pyparsing, you don't need to learn a new syntax for defining grammars or matching expressions - the parsing module provides a library of classes that you use to construct the grammar directly in Python.
Here is a program to parse "Hello, World!" (or any greeting
of the form "<salutation>,
<addressee>!"
), built up using Word, Literal, and And elements ('+'
operator gives And
expressions, strings are auto-converted to Literal expressions):
from pyparsing import Word, alphas # define grammar of a greeting greet = Word(alphas) + "," + Word(alphas) + "!" hello = "Hello, World!" print (hello, "->", greet.parseString(hello))
The program outputs the following:
Hello, World! -> ['Hello', ',', 'World', '!']
The Python representation of the grammar is quite readable, owing to the self-explanatory class names, and the use of '+', '|' and '^' operators.
The ParseResults object returned from ParserElement.parseString can be accessed as a nested list, a dictionary, or an object with named attributes.
The pyparsing module handles some of the problems that are typically vexing when writing text parsers:
Version: 2.2.0
Author: Paul McGuire <ptmcg@users.sourceforge.net>
Classes | |
ParseBaseException base exception class for all parsing runtime exceptions |
|
ParseException Exception thrown when parse expressions don't match class; supported attributes by name are: |
|
ParseFatalException user-throwable exception thrown when inconsistent parse content is found; stops all parsing immediately |
|
ParseSyntaxException just like ParseFatalException, but thrown internally when an ErrorStop ('-' operator) indicates that parsing is to stop immediately because an unbacktrackable syntax error has been found |
|
RecursiveGrammarException exception thrown by ParserElement.validate if the grammar could be improperly recursive |
|
ParseResults Structured parse results, to provide multiple means of access to the parsed data: |
|
ParserElement Abstract base level parser element class. |
|
Token Abstract ParserElement subclass, for defining atomic
matching patterns.
|
|
Empty An empty token, will always match. |
|
NoMatch A token that will never match. |
|
Literal Token to exactly match a specified string. |
|
Keyword Token to exactly match a specified string as a keyword, that is, it must be immediately followed by a non-keyword character. |
|
CaselessLiteral Token to match a specified string, ignoring case of letters. |
|
CaselessKeyword Caseless version of Keyword. |
|
CloseMatch A variation on Literal which matches "close" matches, that is, strings with at most 'n' mismatching characters. |
|
Word Token for matching words composed of allowed character sets. |
|
Regex Token for matching strings that match a given regular expression. |
|
QuotedString Token for matching strings that are delimited by quoting characters. |
|
CharsNotIn Token for matching words composed of characters not in a given set (will include whitespace in matched characters if not listed in the provided exclusion set - see example). |
|
White Special matching class for matching whitespace. |
|
GoToColumn Token to advance to a specific column of input text; useful for tabular report scraping. |
|
LineStart Matches if current position is at the beginning of a line within the parse string |
|
LineEnd Matches if current position is at the end of a line within the parse string |
|
StringStart Matches if current position is at the beginning of the parse string |
|
StringEnd Matches if current position is at the end of the parse string |
|
WordStart Matches if the current position is at the beginning of a Word, and is not preceded by any character in a given set of wordChars (default=printables ).
|
|
WordEnd Matches if the current position is at the end of a Word, and is not followed by any character in a given set of wordChars
(default=printables ).
|
|
ParseExpression Abstract subclass of ParserElement, for combining and post-processing parsed tokens. |
|
And Requires all given ParseExpression s to be found in the
given order.
|
|
Or Requires that at least one ParseExpression is found.
|
|
MatchFirst Requires that at least one ParseExpression is found.
|
|
Each Requires all given ParseExpression s to be found, but
in any order.
|
|
ParseElementEnhance Abstract subclass of ParserElement , for combining and
post-processing parsed tokens.
|
|
FollowedBy Lookahead matching of the given parse expression. |
|
NotAny Lookahead to disallow matching with the given parse expression. |
|
OneOrMore Repetition of one or more of the given expression. |
|
ZeroOrMore Optional repetition of zero or more of the given expression. |
|
Optional Optional matching of the given expression. |
|
SkipTo Token for skipping over all undefined text until the matched expression is found. |
|
Forward Forward declaration of an expression to be defined later - used for recursive grammars, such as algebraic infix notation. |
|
TokenConverter Abstract subclass of ParseExpression , for converting
parsed results.
|
|
Combine Converter to concatenate all matching tokens to a single string. |
|
Group Converter to return the matched tokens as a list - useful for returning tokens of ZeroOrMore and OneOrMore expressions.
|
|
Dict Converter to return a repetitive expression as a list, but also as a dictionary. |
|
Suppress Converter for ignoring the results of a parsed expression. |
|
OnlyOnce Wrapper for parse actions, to ensure they are only called once. |
|
pyparsing_common Here are some common low-level expressions that may be useful in jump-starting parser development: |
Functions | |||
|
|||
|
|||
|
|||
|
|||
|
|||
|
|||
|
|||
|
|||
|
|||
|
|||
|
|||
|
|||
|
|||
|
|||
|
|||
|
|||
|
|||
|
|||
|
|||
|
|||
|
|||
|
|||
|
|||
|
|||
|
|||
|
|||
|
|||
|
|||
|
|||
|
Variables | |
alphas =
|
|
nums =
|
|
hexnums =
|
|
alphanums =
|
|
printables =
|
|
empty = empty
|
|
lineStart = lineStart
|
|
lineEnd = lineEnd
|
|
stringStart = stringStart
|
|
stringEnd = stringEnd
|
|
opAssoc = _Constants()
|
|
dblQuotedString = string enclosed in double quotes
|
|
sglQuotedString = string enclosed in single quotes
|
|
quotedString = quotedString using single or double quotes
|
|
unicodeString = unicode string literal
|
|
alphas8bit =
|
|
punc8bit =
|
|
commonHTMLEntity = common HTML entity
|
|
cStyleComment = C style comment Comment of the form /* ... */
|
|
htmlComment = HTML comment Comment of the form <!-- ... -->
|
|
restOfLine = rest of line
|
|
dblSlashComment = // comment Comment of the form // ... (to end of line)
|
|
cppStyleComment = C++ style comment Comment of either form cStyleComment or dblSlashComment
|
|
javaStyleComment = C++ style comment Same as cppStyleComment
|
|
pythonStyleComment = Python style comment Comment of the form # ... (to end of line)
|
|
commaSeparatedList = commaSeparatedList (Deprecated) Predefined expression of 1 or more printable words or quoted strings, separated by commas. |
|
anyCloseTag = </any tag>
|
|
anyOpenTag = <any tag>
|
Function Details |
Returns current column within a string, counting newlines as line separators. The first column is number 1. Note: the default parsing behavior is to expand tabs in the input
string before starting the parsing process. See ParserElement.parseString for more information on
parsing strings containing |
Returns current line number within a string, counting newlines as line separators. The first line is number 1. Note: the default parsing behavior is to expand tabs in the input
string before starting the parsing process. See ParserElement.parseString for more information on
parsing strings containing |
Decorator for debugging parse actions. When the parse action is called, this decorator will print
Example: wd = Word(alphas) @traceParseAction def remove_duplicate_chars(tokens): return ''.join(sorted(set(''.join(tokens))) wds = OneOrMore(wd).setParseAction(remove_duplicate_chars) print(wds.parseString("slkdjs sld sldd sdlf sdljf")) prints: >>entering remove_duplicate_chars(line: 'slkdjs sld sldd sdlf sdljf', 0, (['slkdjs', 'sld', 'sldd', 'sdlf', 'sdljf'], {})) <<leaving remove_duplicate_chars (ret: 'dfjkls') ['dfjkls'] |
Helper to define a delimited list of expressions - the delimiter
defaults to ','. By default, the list elements and delimiters can have
intervening whitespace, and comments, but this can be overridden by
passing Example: delimitedList(Word(alphas)).parseString("aa,bb,cc") # -> ['aa', 'bb', 'cc'] delimitedList(Word(hexnums), delim=':', combine=True).parseString("AA:BB:CC:DD:EE") # -> ['AA:BB:CC:DD:EE'] |
Helper to define a counted list of expressions. This helper defines a pattern of the form: integer expr expr expr... where the leading integer tells how many expr expressions follow. The matched tokens returns the array of expr tokens as a list - the leading count token is suppressed. If Example: countedArray(Word(alphas)).parseString('2 ab cd ef') # -> ['ab', 'cd'] # in this parser, the leading integer value is given in binary, # '10' indicating that 2 values are in the array binaryConstant = Word('01').setParseAction(lambda t: int(t[0], 2)) countedArray(Word(alphas), intExpr=binaryConstant).parseString('10 ab cd ef') # -> ['ab', 'cd'] |
Helper to define an expression that is indirectly defined from the tokens matched in a previous expression, that is, it looks for a 'repeat' of a previous expression. For example: first = Word(nums) second = matchPreviousLiteral(first) matchExpr = first + ":" + second will match |
Helper to define an expression that is indirectly defined from the tokens matched in a previous expression, that is, it looks for a 'repeat' of a previous expression. For example: first = Word(nums) second = matchPreviousExpr(first) matchExpr = first + ":" + second will match |
Helper to quickly define a set of alternative Literals, and makes sure
to do longest-first testing when there is a conflict, regardless of the
input order, but returns a Parameters:
Example: comp_oper = oneOf("< = > <= >= !=") var = Word(alphas) number = Word(nums) term = var | number comparison_expr = term + comp_oper + term print(comparison_expr.searchString("B = 12 AA=23 B<=AA AA>12")) prints: [['B', '=', '12'], ['AA', '=', '23'], ['B', '<=', 'AA'], ['AA', '>', '12']] |
Helper to easily and clearly define a dictionary by specifying the
respective patterns for the key and value. Takes care of defining the
Example: text = "shape: SQUARE posn: upper left color: light blue texture: burlap" attr_expr = (label + Suppress(':') + OneOrMore(data_word, stopOn=label).setParseAction(' '.join)) print(OneOrMore(attr_expr).parseString(text).dump()) attr_label = label attr_value = Suppress(':') + OneOrMore(data_word, stopOn=label).setParseAction(' '.join) # similar to Dict, but simpler call format result = dictOf(attr_label, attr_value).parseString(text) print(result.dump()) print(result['shape']) print(result.shape) # object attribute access works too print(result.asDict()) prints: [['shape', 'SQUARE'], ['posn', 'upper left'], ['color', 'light blue'], ['texture', 'burlap']] - color: light blue - posn: upper left - shape: SQUARE - texture: burlap SQUARE SQUARE {'color': 'light blue', 'shape': 'SQUARE', 'posn': 'upper left', 'texture': 'burlap'} |
Helper to return the original, untokenized text for a given expression. Useful to restore the parsed fields of an HTML start tag into the raw tag text itself, or to revert separate tokens with intervening whitespace back to the original matching input text. By default, returns astring containing the original parsed text. If the optional Example: src = "this is test <b> bold <i>text</i> </b> normal text " for tag in ("b","i"): opener,closer = makeHTMLTags(tag) patt = originalTextFor(opener + SkipTo(closer) + closer) print(patt.searchString(src)[0]) prints: ['<b> bold <i>text</i> </b>'] ['<i>text</i>'] |
Helper to decorate a returned token with its starting and ending locations in the input string. This helper adds the following results names:
Be careful if the input text contains Example: wd = Word(alphas) for match in locatedExpr(wd).searchString("ljsdf123lksdjjf123lkkjj1222"): print(match) prints: [[0, 'ljsdf', 5]] [[8, 'lksdjjf', 15]] [[18, 'lkkjj', 23]] |
Helper to easily define string ranges for use in Word construction. Borrows syntax from regexp '[]' string range definitions: srange("[0-9]") -> "0123456789" srange("[a-z]") -> "abcdefghijklmnopqrstuvwxyz" srange("[a-z$_]") -> "abcdefghijklmnopqrstuvwxyz$_" The input string must be enclosed in []'s, and the returned string is the expanded character set joined into a single string. The values enclosed in the []'s may be:
|
Helper method for common parse actions that simply return a literal
value. Especially useful when used with Example: num = Word(nums).setParseAction(lambda toks: int(toks[0])) na = oneOf("N/A NA").setParseAction(replaceWith(math.nan)) term = na | num OneOrMore(term).parseString("324 234 N/A 234") # -> [324, 234, nan, 234] |
Helper parse action for removing quotation marks from parsed quoted strings. Example: # by default, quotation marks are included in parsed results quotedString.parseString("'Now is the Winter of our Discontent'") # -> ["'Now is the Winter of our Discontent'"] # use removeQuotes to strip quotation marks from parsed results quotedString.setParseAction(removeQuotes) quotedString.parseString("'Now is the Winter of our Discontent'") # -> ["Now is the Winter of our Discontent"] |
Helper to define a parse action by mapping a function to all elements
of a ParseResults list.If any additional args are passed, they are
forwarded to the given function as additional arguments after the token,
as in Example (compare the last to example in ParserElement.transformString: hex_ints = OneOrMore(Word(hexnums)).setParseAction(tokenMap(int, 16)) hex_ints.runTests(''' 00 11 22 aa FF 0a 0d 1a ''') upperword = Word(alphas).setParseAction(tokenMap(str.upper)) OneOrMore(upperword).runTests(''' my kingdom for a horse ''') wd = Word(alphas).setParseAction(tokenMap(str.title)) OneOrMore(wd).setParseAction(' '.join).runTests(''' now is the winter of our discontent made glorious summer by this sun of york ''') prints: 00 11 22 aa FF 0a 0d 1a [0, 17, 34, 170, 255, 10, 13, 26] my kingdom for a horse ['MY', 'KINGDOM', 'FOR', 'A', 'HORSE'] now is the winter of our discontent made glorious summer by this sun of york ['Now Is The Winter Of Our Discontent Made Glorious Summer By This Sun Of York'] |
(Deprecated) Helper parse action to convert tokens to upper case. Deprecated in favor of pyparsing_common.upcaseTokens |
(Deprecated) Helper parse action to convert tokens to lower case. Deprecated in favor of pyparsing_common.downcaseTokens |
Helper to construct opening and closing tag expressions for HTML, given a tag name. Matches tags in either upper or lower case, attributes with namespaces and with quoted or unquoted values. Example: text = '<td>More info at the <a href="http://pyparsing.wikispaces.com">pyparsing</a> wiki page</td>' # makeHTMLTags returns pyparsing expressions for the opening and closing tags as a 2-tuple a,a_end = makeHTMLTags("A") link_expr = a + SkipTo(a_end)("link_text") + a_end for link in link_expr.searchString(text): # attributes in the <A> tag (like "href" shown here) are also accessible as named results print(link.link_text, '->', link.href) prints: pyparsing -> http://pyparsing.wikispaces.com |
Helper to construct opening and closing tag expressions for XML, given a tag name. Matches tags only in the given upper/lower case. Example: similar to makeHTMLTags |
Helper to create a validating parse action to be used with start tags
created with Call
For attribute names with a namespace prefix, you must use the second form. Attribute names are matched insensitive to upper/lower case. If just testing for To verify that the attribute exists, but without specifying a value,
pass Example: html = ''' <div> Some text <div type="grid">1 4 0 1 0</div> <div type="graph">1,3 2,3 1,1</div> <div>this has no type</div> </div> ''' div,div_end = makeHTMLTags("div") # only match div tag having a type attribute with value "grid" div_grid = div().setParseAction(withAttribute(type="grid")) grid_expr = div_grid + SkipTo(div | div_end)("body") for grid_header in grid_expr.searchString(html): print(grid_header.body) # construct a match with any div tag having a type attribute, regardless of the value div_any_type = div().setParseAction(withAttribute(type=withAttribute.ANY_VALUE)) div_expr = div_any_type + SkipTo(div | div_end)("body") for div_header in div_expr.searchString(html): print(div_header.body) prints: 1 4 0 1 0 1 4 0 1 0 1,3 2,3 1,1 |
Simplified version of Example: html = ''' <div> Some text <div class="grid">1 4 0 1 0</div> <div class="graph">1,3 2,3 1,1</div> <div>this <div> has no class</div> </div> ''' div,div_end = makeHTMLTags("div") div_grid = div().setParseAction(withClass("grid")) grid_expr = div_grid + SkipTo(div | div_end)("body") for grid_header in grid_expr.searchString(html): print(grid_header.body) div_any_type = div().setParseAction(withClass(withAttribute.ANY_VALUE)) div_expr = div_any_type + SkipTo(div | div_end)("body") for div_header in div_expr.searchString(html): print(div_header.body) prints: 1 4 0 1 0 1 4 0 1 0 1,3 2,3 1,1 |
Helper method for constructing grammars of expressions made up of operators working in a precedence hierarchy. Operators may be unary or binary, left- or right-associative. Parse actions can also be attached to operator expressions. The generated parser will also recognize the use of parentheses to override operator precedences (see example below). Note: if you define a deep operator list, you may see performance issues when using infixNotation. See ParserElement.enablePackrat for a mechanism to potentially improve your parser performance. Parameters:
Example: # simple example of four-function arithmetic with ints and variable names integer = pyparsing_common.signed_integer varname = pyparsing_common.identifier arith_expr = infixNotation(integer | varname, [ ('-', 1, opAssoc.RIGHT), (oneOf('* /'), 2, opAssoc.LEFT), (oneOf('+ -'), 2, opAssoc.LEFT), ]) arith_expr.runTests(''' 5+3*6 (5+3)*6 -2--11 ''', fullDump=False) prints: 5+3*6 [[5, '+', [3, '*', 6]]] (5+3)*6 [[[5, '+', 3], '*', 6]] -2--11 [[['-', 2], '-', ['-', 11]]] |
(Deprecated) Former name of |
Helper method for defining nested lists enclosed in opening and closing delimiters ("(" and ")" are the default). Parameters:
If an expression is not provided for the content argument, the nested expression will capture all whitespace-delimited content between delimiters as a list of separate values. Use the Example: data_type = oneOf("void int short long char float double") decl_data_type = Combine(data_type + Optional(Word('*'))) ident = Word(alphas+'_', alphanums+'_') number = pyparsing_common.number arg = Group(decl_data_type + ident) LPAR,RPAR = map(Suppress, "()") code_body = nestedExpr('{', '}', ignoreExpr=(quotedString | cStyleComment)) c_function = (decl_data_type("type") + ident("name") + LPAR + Optional(delimitedList(arg), [])("args") + RPAR + code_body("body")) c_function.ignore(cStyleComment) source_code = ''' int is_odd(int x) { return (x%2); } int dec_to_hex(char hchar) { if (hchar >= '0' && hchar <= '9') { return (ord(hchar)-ord('0')); } else { return (10+ord(hchar)-ord('A')); } } ''' for func in c_function.searchString(source_code): print("%(name)s (%(type)s) args: %(args)s" % func) prints: is_odd (int) args: [['int', 'x']] dec_to_hex (int) args: [['char', 'hchar']] |
Helper method for defining space-delimited indentation blocks, such as those used to define block statements in Python source code. Parameters:
A valid block must contain at least one
Example: data = ''' def A(z): A1 B = 100 G = A2 A2 A3 B def BB(a,b,c): BB1 def BBA(): bba1 bba2 bba3 C D def spam(x,y): def eggs(z): pass ''' indentStack = [1] stmt = Forward() identifier = Word(alphas, alphanums) funcDecl = ("def" + identifier + Group( "(" + Optional( delimitedList(identifier) ) + ")" ) + ":") func_body = indentedBlock(stmt, indentStack) funcDef = Group( funcDecl + func_body ) rvalue = Forward() funcCall = Group(identifier + "(" + Optional(delimitedList(rvalue)) + ")") rvalue << (funcCall | identifier | Word(nums)) assignment = Group(identifier + "=" + rvalue) stmt << ( funcDef | assignment | identifier ) module_body = OneOrMore(stmt) parseTree = module_body.parseString(data) parseTree.pprint() prints: [['def', 'A', ['(', 'z', ')'], ':', [['A1'], [['B', '=', '100']], [['G', '=', 'A2']], ['A2'], ['A3']]], 'B', ['def', 'BB', ['(', 'a', 'b', 'c', ')'], ':', [['BB1'], [['def', 'BBA', ['(', ')'], ':', [['bba1'], ['bba2'], ['bba3']]]]]], 'C', 'D', ['def', 'spam', ['(', 'x', 'y', ')'], ':', [[['def', 'eggs', ['(', 'z', ')'], ':', [['pass']]]]]]] |
Variables Details |
alphanums
|
printables
|
alphas8bit
|
commaSeparatedList(Deprecated) Predefined expression of 1 or more printable words or quoted strings, separated by commas. This expression is deprecated in favor of pyparsing_common.comma_separated_list.
|
Home | Trees | Indices | Help |
|
---|
Generated by Epydoc 3.0.1 on Sun Mar 05 20:19:55 2017 | http://epydoc.sourceforge.net |