This documents describes Spidy language syntax, statements and operators.
Loads documnet specified by URL string either from Web or local file system. The document becomes main source of data for the whole execution context and &, skip, traverse operators and statements:
get 'www.someresource.com/document.html'
To specify expected document format use as operator:
get 'www.someresource.com/index' as html
To specify headers, end statement by : and add headers as indented block:
get 'www.someresource.com/index' as html:
User-agent: 'Firefox/25.0'
When loading from Web resource, default headers are:
Accept : 'text/html,application/xhtml+xml,application/xml',
Accept-Language : 'en-US,en;q=0.5',
Connection : 'keep-alive',
Cache-Control : 'max-age=0'
Moves document path pointer to specified element in document tree. The element to move pointer to is specified using XPath expression.
Optionally, skip direction can be specified: - forward (default) - reverse
Note
All selectors are still applied when in reverse mode.
Note
skip command’s XPath expression always evaluates to single element even if index selector is not specified.
Example:
skip &'//div[@id=data_container]'
Traverses document tree from specified element or from document root - sub-branch is specified by path operator.
Note
Internally, setting sub-branch to certain element is done by skip command. Thereby sometimes when it fails you can see SkipNode messages in log.
Sets loop variable to absolute (document-wide) element’s path string, which can be used in path &, skip or another traverse operators.
Optionally, traverse method can be specified:
If method is specified, depth of traversing can be set as well. Default is 1.
Example, default form:
traverse div in &'//div[@data_container]':
if &div == '':
break
Or, visit each document’s element to find pictures:
images = []
traverse element in & depthfirst 1000000:
if 'img' in element:
images << &(element + '@src')
Typical for...in statement, iterates through collection of items.
For example:
for item in items:
lst << item
Typical while loop. Loops until expression is True. If expression evaluates to list, loops until it has items (like in Python)
Example:
host = 'www.cars.com/search'
while lst:
lst >> next
get (host + next) as json
Typical break loop statement. Breaks for, while and traverse statements.
Typical continue loop statement. Breaks for, while and traverse statements.
Typical if...else statement, implements execution flow control. else is optional.
Example:
if loaded:
str = str + $count
result = str
else:
result = 'failed :('
Merges current execution context (all defined variables at the step) with specified template and stores result in variable. Defines new variable if not exists.
The statement uses standard Python string template syntax, e.g.:
Hello, ${name}!
Having variable name set to ‘Alex’ results in:
Hello, Alex!
Example:
merge 'master_page.html' as page
Spidy syntax supports all basic arithmetic, logical and comparison operators as well as some custom ones, e.g.: XPath & or % regex. The table below describe operators precedence:
Operator | Precedence |
---|---|
x[index] | 13 |
>>, <<, & | 12 |
-x, +x, $, #, % | 11 |
/ | 10 |
* | 9 |
- | 8 |
+ | 7 |
==,!=,<,>,<=,>= | 6 |
in | 5 |
not | 4 |
and | 3 |
or | 2 |
= | 1 |
in operator is applicable to lists or strings.
Note
When used with strings, in always compares strings in lower case.
Instantiating:
list = []
list = [1,2,3]
Indexer:
list = [1,2,3]
second = list[1]
Pop >>:
Pops item for list. Optionally, pops item to specified variable. If list index is specified - removes item from list.
Example, pop from list:
lst >>
Pop to variable:
lst >> current
Remove first item from list:
lst[0] >> first
Push <<:
Appends item to list. Optionally, inserts item at specified position, if list index is specified.
Example, push to list:
lst << 5
Insert at certain index:
lst[5] << 5
To string $:
number = 1010101
string = $number
To number #:
string = '1010101'
number = #string
Interprets input string as XPath expression and evaluates it against current document. If index selector is specified in XPath expression, return single value, otherwise returns list.
Note
Document should be loaded using get command before using & operator.
If used without path - &, returns raw document’s contents.
If XPath expression can’t be resolved, returns either empty string or list.
Example:
get 'http://myprofile/main.html'
name = &'//span[@class=namefield][1]'
Supports the following selectors:
Regex operator allows to extract sub-string from strings using capturing groups. If two or more capturing groups are specified, operator returns list of captured values.
Note, if regex is matched agains list, regex is matched against every list item same way it’s matched against single value. Eventually left and right operands should be of string type, otherwise evaluation exception is raised.
Regex operator tries to find all matches in the input string, not just stops at the first match. If two or more matches found, operator returns list of matches. The resulting list size is multiple of matches and number of capturing groups specified.
If capturing groups are omitted, operator returns the whole match - same way standard regex does.
Example:
r = 'hello, John!' % 'hello, ([a-zA-Z]+)!'
will result in: 'John'.
Or for list of strings:
r = ['hello, John!', 'hello, Kate!'] % 'hello, ([a-zA-Z]+)!'
returns the following list: ['John', 'Kate']
Spidy grammar in slightly modified Extended Backus–Naur Form. Primitive types, e.g.: string, bool are omitted:
script = statement*
statement = for_stmnt
| traverse_stmnt
| while_stmnt
| break_stmnt
| continue_stmnt
| if_stmnt
| else_stmnt
| get_stmnt
| skip_stmnt
| return_stmnt
| merge_stmnt
| log_stmnt
| expression
for_stmnt = "for" identity "in" list ":"
break_stmnt = "break"
continue_stmnt = "continue"
while_stmnt = "while" bool:
traverse_stmnt = "traverse" identity "in" path [mode depth]
path = & | &string
mode = breadthfirst | depthfirst
depth = number
if_stmnt = "if" bool ":"
else_stmnt = "else:"
get_stmnt = "get" string[":"]
[header_stmnt]
header_stmnt = string ":" string
skip_stmnt = "skip" path [direction]
direction = forward | reverse
return_stmnt = "return" [expression]
merge_stmnt = "merge" template_string "as" identity
log_stmnt = “log” string
expression = value
| "(" ")" | "(" e ")" | "-" e | "+" e
| e "+" e | e "-" e | e "*" e | e "/" e
| e "<" e | e "<=" e | e "==" e | e "!=" e
| e ">" e | e ">=" e
| "not" e | e "or" e | e "and" e | e "in" e
| path | "#" e | "$" e | e "%" e
| indexer
|(indexer | identity) = e
|(indexer | list) << e
|(indexer | list) >> (indexer | identity)
e = expression
value = none | number | string | bool | list
list = "[" [e ("," e)*] "]"
indexer = e "[" e "]"