Syntax Reference¶

This documents describes Spidy language syntax, statements and operators.

Spidy Commands¶

get¶

class spidy.language.get_node.GetNode(context)¶

Loads documnet specified by URL string either from Web or local file system. The document becomes main source of data for the whole execution context and &, skip, traverse operators and statements:

get 'www.someresource.com/document.html'

To specify expected document format use as operator:

get 'www.someresource.com/index' as html

To specify headers, end statement by : and add headers as indented block:

get 'www.someresource.com/index' as html:
    User-agent: 'Firefox/25.0'

When loading from Web resource, default headers are:

Accept          : 'text/html,application/xhtml+xml,application/xml',
Accept-Language : 'en-US,en;q=0.5',
Connection      : 'keep-alive',
Cache-Control   : 'max-age=0'

skip¶

class spidy.language.skip_node.SkipNode(context)¶

Moves document path pointer to specified element in document tree. The element to move pointer to is specified using XPath expression.

Optionally, skip direction can be specified: - forward (default) - reverse

Note

All selectors are still applied when in reverse mode.

Note

skip command’s XPath expression always evaluates to single element even if index selector is not specified.

Example:

skip &'//div[@id=data_container]'

traverse¶

class spidy.language.traverse_node.TraverseNode(context)¶

Traverses document tree from specified element or from document root - sub-branch is specified by path operator.

Note

Internally, setting sub-branch to certain element is done by skip command. Thereby sometimes when it fails you can see SkipNode messages in log.

Sets loop variable to absolute (document-wide) element’s path string, which can be used in path &, skip or another traverse operators.

Optionally, traverse method can be specified:

breadthfirst (default)
depthfirst

If method is specified, depth of traversing can be set as well. Default is 1.

Example, default form:

traverse div in &'//div[@data_container]':
    if &div == '':
        break

Or, visit each document’s element to find pictures:

images = []
traverse element in & depthfirst 1000000:
    if 'img' in element:
        images << &(element + '@src')    

for...in¶

class spidy.language.forin_node.ForInNode(context)¶

Typical for...in statement, iterates through collection of items.

For example:

for item in items:
    lst << item

while¶

class spidy.language.while_node.WhileNode(context)¶

Typical while loop. Loops until expression is True. If expression evaluates to list, loops until it has items (like in Python)

Example:

host = 'www.cars.com/search'
while lst:
    lst >> next
    get (host + next) as json                        

break¶

class spidy.language.break_node.BreakNode(context)¶: Typical break loop statement. Breaks for, while and traverse statements.

continue¶

class spidy.language.continue_node.ContinueNode(context)¶: Typical continue loop statement. Breaks for, while and traverse statements.

if...else¶

class spidy.language.ifelse_node.IfElseNode(context)¶

Typical if...else statement, implements execution flow control. else is optional.

Example:

if loaded:
    str = str + $count
    result = str
else:
    result = 'failed :('            

merge¶

class spidy.language.merge_node.MergeNode(context)¶

Merges current execution context (all defined variables at the step) with specified template and stores result in variable. Defines new variable if not exists.

The statement uses standard Python string template syntax, e.g.:

Hello, ${name}!

Having variable name set to ‘Alex’ results in:

Hello, Alex!

Example:

merge 'master_page.html' as page

log¶

class spidy.language.log_node.LogNode(context)¶

Logs string to file or stdout, ignores empty strings.

Example:

log 'loading next page'

Note

The statement logs messages as INFO.

return¶

class spidy.language.return_node.ReturnNode(context)¶

Terminates script execution and returns immediately. If expression is specified, evaluates it and returns result.

Example:

return

Or:

return (2 + 2) == 4

Spidy Operators¶

Spidy syntax supports all basic arithmetic, logical and comparison operators as well as some custom ones, e.g.: XPath & or % regex. The table below describe operators precedence:

Operator	Precedence
x[index]	13
>>, <<, &	12
-x, +x, $, #, %	11
/	10
*	9
-	8
+	7
==,!=,<,>,<=,>=	6
in	5
not	4
and	3
or	2
=	1

Containment Test¶

in operator is applicable to lists or strings.

Note

When used with strings, in always compares strings in lower case.

Lists¶

Instantiating:

list = []
list = [1,2,3]

Indexer:

list = [1,2,3]
second = list[1]

Pop >>:

class spidy.language.pop_node.PopNode(context)¶

Pops item for list. Optionally, pops item to specified variable. If list index is specified - removes item from list.

Example, pop from list:

lst >>

Pop to variable:

lst >> current

Remove first item from list:

lst[0] >> first    

Push <<:

class spidy.language.push_node.PushNode(context)¶

Appends item to list. Optionally, inserts item at specified position, if list index is specified.

Example, push to list:

lst << 5

Insert at certain index:

lst[5] << 5

Type Conversion¶

To string $:

number = 1010101
string = $number

To number #:

string = '1010101'
number = #string

XPath¶

class spidy.language.path_node.PathNode(context)¶

Interprets input string as XPath expression and evaluates it against current document. If index selector is specified in XPath expression, return single value, otherwise returns list.

Note

Document should be loaded using get command before using & operator.

If used without path - &, returns raw document’s contents.

If XPath expression can’t be resolved, returns either empty string or list.

Example:

get 'http://myprofile/main.html'
name = &'//span[@class=namefield][1]'

Supports the following selectors:

children selector: /
self plus all descendants selector: //
implicit self plus all descendants selector, if path starts from word character, e.g.: span[1]
name selector, ‘any’ and ‘current’ wildcards: /div or /* or .
index selector: /div[2]
attribute and/or it’s value selector: /div[@class] or /div[@class=header]
attribute getter: /div@class
alternative paths: /div[2] | /span[1]

Regex¶

class spidy.language.regex_node.RegexNode(context)¶

Regex operator allows to extract sub-string from strings using capturing groups. If two or more capturing groups are specified, operator returns list of captured values.

Note, if regex is matched agains list, regex is matched against every list item same way it’s matched against single value. Eventually left and right operands should be of string type, otherwise evaluation exception is raised.

Regex operator tries to find all matches in the input string, not just stops at the first match. If two or more matches found, operator returns list of matches. The resulting list size is multiple of matches and number of capturing groups specified.

If capturing groups are omitted, operator returns the whole match - same way standard regex does.

Example:

r = 'hello, John!' % 'hello, ([a-zA-Z]+)!'

will result in: 'John'.

Or for list of strings:

r = ['hello, John!', 'hello, Kate!'] % 'hello, ([a-zA-Z]+)!'

returns the following list: ['John', 'Kate']

Comments¶

//:

// initialize
log 'warming up...'
host = 'http://somehost.org''

Grammar Definition¶

Spidy grammar in slightly modified Extended Backus–Naur Form. Primitive types, e.g.: string, bool are omitted:

script           =  statement*

statement        = for_stmnt
                 | traverse_stmnt
                 | while_stmnt
                 | break_stmnt
                 | continue_stmnt
                 | if_stmnt
                 | else_stmnt
                 | get_stmnt
                 | skip_stmnt
                 | return_stmnt
                 | merge_stmnt
                 | log_stmnt
                 | expression

for_stmnt        =  "for" identity "in" list ":"

break_stmnt      =  "break"

continue_stmnt   =  "continue"

while_stmnt      =  "while" bool:

traverse_stmnt   =  "traverse" identity "in" path [mode depth]

path             =  & | &string

mode             =  breadthfirst | depthfirst

depth            =  number

if_stmnt         =  "if" bool ":"

else_stmnt       =  "else:"

get_stmnt        =  "get" string[":"]
                       [header_stmnt]

header_stmnt     =  string ":" string

skip_stmnt       =  "skip" path [direction]

direction        =  forward | reverse

return_stmnt     =  "return" [expression]

merge_stmnt      =  "merge" template_string "as" identity

log_stmnt        =  “log” string

expression       =  value
                 | "(" ")" | "(" e ")" |   "-"   e |   "+"  e
                 | e "+" e | e "-"  e  | e "*"   e | e "/"  e
                 | e "<" e | e "<=" e  | e "=="  e | e "!=" e
                 | e ">" e | e ">=" e
                 | "not" e | e "or" e  | e "and" e | e "in" e
                 |  path   |  "#" e    | "$" e     | e "%"  e
                 | indexer
                 |(indexer | identity) = e
                 |(indexer | list) << e
                 |(indexer | list) >> (indexer | identity)

e                =  expression

value            =  none | number | string | bool | list

list             =  "[" [e ("," e)*] "]"

indexer          =  e "[" e "]"

Syntax Reference¶

Spidy Commands¶

get¶

skip¶

traverse¶

for...in¶

while¶

break¶

continue¶

if...else¶

merge¶

log¶

return¶

Spidy Operators¶

Containment Test¶

Lists¶

Type Conversion¶

XPath¶

Regex¶

Comments¶

Grammar Definition¶

Table Of Contents

Previous topic

Next topic

This Page

Navigation

Syntax Reference¶

Spidy Commands¶

get¶

skip¶

traverse¶

for...in¶

while¶

break¶

continue¶

if...else¶

merge¶

log¶

return¶

Spidy Operators¶

Containment Test¶

Lists¶

Type Conversion¶

XPath¶

Regex¶

Comments¶

Grammar Definition¶

Table Of Contents

Previous topic

Next topic

This Page

Quick search

Navigation