Syntax Reference

This documents describes Spidy language syntax, statements and operators.

Spidy Commands

get

class spidy.language.get_node.GetNode(context)

Loads documnet specified by URL string either from Web or local file system. The document becomes main source of data for the whole execution context and &, skip, traverse operators and statements:

get 'www.someresource.com/document.html'

To specify expected document format use as operator:

get 'www.someresource.com/index' as html

To specify headers, end statement by : and add headers as indented block:

get 'www.someresource.com/index' as html:
    User-agent: 'Firefox/25.0'

When loading from Web resource, default headers are:

Accept          : 'text/html,application/xhtml+xml,application/xml',
Accept-Language : 'en-US,en;q=0.5',
Connection      : 'keep-alive',
Cache-Control   : 'max-age=0'

traverse

class spidy.language.traverse_node.TraverseNode(context)

Traverses document tree from specified element or from document root - sub-branch is specified by path operator.

Note

Internally, setting sub-branch to certain element is done by skip command. Thereby sometimes when it fails you can see SkipNode messages in log.

Sets loop variable to absolute (document-wide) element’s path string, which can be used in path &, skip or another traverse operators.

Optionally, traverse method can be specified:

  • breadthfirst (default)
  • depthfirst

If method is specified, depth of traversing can be set as well. Default is 1.

Example, default form:

traverse div in &'//div[@data_container]':
    if &div == '':
        break

Or, visit each document’s element to find pictures:

images = []
traverse element in & depthfirst 1000000:
    if 'img' in element:
        images << &(element + '@src')    

for...in

class spidy.language.forin_node.ForInNode(context)

Typical for...in statement, iterates through collection of items.

For example:

for item in items:
    lst << item

while

class spidy.language.while_node.WhileNode(context)

Typical while loop. Loops until expression is True. If expression evaluates to list, loops until it has items (like in Python)

Example:

host = 'www.cars.com/search'
while lst:
    lst >> next
    get (host + next) as json                        

break

class spidy.language.break_node.BreakNode(context)

Typical break loop statement. Breaks for, while and traverse statements.

continue

class spidy.language.continue_node.ContinueNode(context)

Typical continue loop statement. Breaks for, while and traverse statements.

if...else

class spidy.language.ifelse_node.IfElseNode(context)

Typical if...else statement, implements execution flow control. else is optional.

Example:

if loaded:
    str = str + $count
    result = str
else:
    result = 'failed :('            

merge

class spidy.language.merge_node.MergeNode(context)

Merges current execution context (all defined variables at the step) with specified template and stores result in variable. Defines new variable if not exists.

The statement uses standard Python string template syntax, e.g.:

Hello, ${name}!

Having variable name set to ‘Alex’ results in:

Hello, Alex!

Example:

merge 'master_page.html' as page

log

class spidy.language.log_node.LogNode(context)

Logs string to file or stdout, ignores empty strings.

Example:

log 'loading next page'

Note

The statement logs messages as INFO.

return

class spidy.language.return_node.ReturnNode(context)

Terminates script execution and returns immediately. If expression is specified, evaluates it and returns result.

Example:

return

Or:

return (2 + 2) == 4

Spidy Operators

Spidy syntax supports all basic arithmetic, logical and comparison operators as well as some custom ones, e.g.: XPath & or % regex. The table below describe operators precedence:

Operator Precedence
x[index] 13
>>, <<, & 12
-x, +x, $, #, % 11
/ 10
* 9
- 8
+ 7
==,!=,<,>,<=,>= 6
in 5
not 4
and 3
or 2
= 1

Containment Test

in operator is applicable to lists or strings.

Note

When used with strings, in always compares strings in lower case.

Lists

Instantiating:

list = []
list = [1,2,3]

Indexer:

list = [1,2,3]
second = list[1]

Pop >>:

class spidy.language.pop_node.PopNode(context)

Pops item for list. Optionally, pops item to specified variable. If list index is specified - removes item from list.

Example, pop from list:

lst >>

Pop to variable:

lst >> current

Remove first item from list:

lst[0] >> first    

Push <<:

class spidy.language.push_node.PushNode(context)

Appends item to list. Optionally, inserts item at specified position, if list index is specified.

Example, push to list:

lst << 5

Insert at certain index:

lst[5] << 5

Type Conversion

To string $:

number = 1010101
string = $number

To number #:

string = '1010101'
number = #string

XPath

class spidy.language.path_node.PathNode(context)

Interprets input string as XPath expression and evaluates it against current document. If index selector is specified in XPath expression, return single value, otherwise returns list.

Note

Document should be loaded using get command before using & operator.

If used without path - &, returns raw document’s contents.

If XPath expression can’t be resolved, returns either empty string or list.

Example:

get 'http://myprofile/main.html'
name = &'//span[@class=namefield][1]'

Supports the following selectors:

  • children selector: /
  • self plus all descendants selector: //
  • implicit self plus all descendants selector, if path starts from word character, e.g.: span[1]
  • name selector, ‘any’ and ‘current’ wildcards: /div or /* or .
  • index selector: /div[2]
  • attribute and/or it’s value selector: /div[@class] or /div[@class=header]
  • attribute getter: /div@class
  • alternative paths: /div[2] | /span[1]

Regex

class spidy.language.regex_node.RegexNode(context)

Regex operator allows to extract sub-string from strings using capturing groups. If two or more capturing groups are specified, operator returns list of captured values.

Note, if regex is matched agains list, regex is matched against every list item same way it’s matched against single value. Eventually left and right operands should be of string type, otherwise evaluation exception is raised.

Regex operator tries to find all matches in the input string, not just stops at the first match. If two or more matches found, operator returns list of matches. The resulting list size is multiple of matches and number of capturing groups specified.

If capturing groups are omitted, operator returns the whole match - same way standard regex does.

Example:

r = 'hello, John!' % 'hello, ([a-zA-Z]+)!'

will result in: 'John'.

Or for list of strings:

r = ['hello, John!', 'hello, Kate!'] % 'hello, ([a-zA-Z]+)!'

returns the following list: ['John', 'Kate']

Comments

//:

// initialize
log 'warming up...'
host = 'http://somehost.org''

Grammar Definition

Spidy grammar in slightly modified Extended Backus–Naur Form. Primitive types, e.g.: string, bool are omitted:

script           =  statement*

statement        = for_stmnt
                 | traverse_stmnt
                 | while_stmnt
                 | break_stmnt
                 | continue_stmnt
                 | if_stmnt
                 | else_stmnt
                 | get_stmnt
                 | skip_stmnt
                 | return_stmnt
                 | merge_stmnt
                 | log_stmnt
                 | expression

for_stmnt        =  "for" identity "in" list ":"

break_stmnt      =  "break"

continue_stmnt   =  "continue"

while_stmnt      =  "while" bool:

traverse_stmnt   =  "traverse" identity "in" path [mode depth]

path             =  & | &string

mode             =  breadthfirst | depthfirst

depth            =  number

if_stmnt         =  "if" bool ":"

else_stmnt       =  "else:"

get_stmnt        =  "get" string[":"]
                       [header_stmnt]

header_stmnt     =  string ":" string

skip_stmnt       =  "skip" path [direction]

direction        =  forward | reverse

return_stmnt     =  "return" [expression]

merge_stmnt      =  "merge" template_string "as" identity

log_stmnt        =  “log” string

expression       =  value
                 | "(" ")" | "(" e ")" |   "-"   e |   "+"  e
                 | e "+" e | e "-"  e  | e "*"   e | e "/"  e
                 | e "<" e | e "<=" e  | e "=="  e | e "!=" e
                 | e ">" e | e ">=" e
                 | "not" e | e "or" e  | e "and" e | e "in" e
                 |  path   |  "#" e    | "$" e     | e "%"  e
                 | indexer
                 |(indexer | identity) = e
                 |(indexer | list) << e
                 |(indexer | list) >> (indexer | identity)

e                =  expression

value            =  none | number | string | bool | list

list             =  "[" [e ("," e)*] "]"

indexer          =  e "[" e "]"