Writing Scripts

It’s time to write first serious script. Spidy script is a file with *.sp extension. Yes, Spidy claims *.sp extension for its script files and *.spt - for templates used by the scripts. First, lets briefly speak about fundamentals.

Typical Script

Typical script consists of the following snippets:

  • loading document and checking results
  • skip to branch where data is located (optional)
  • applying selectors or traversing document to collect data
  • preparing and returning results

Documents and Formats

Loading document is an important first step, if it fails the whole script may be aborted. To accomplish that, Spidy has a very cool command - get. It sends simple Web GET request and returns response. If command succeeds, document becomes available for scraping.

By default, Spidy sniffs document format from extension. If URL doesn’t have one - it should be explicitly specified using as operator. At the moment the following formats are supported:

  • HTML
  • XML
  • TXT
  • JSON

Note

When loading document as text (TXT), path, skip and traverse statements will not work since response is treated as not structured document.

News Example

Alright, the time has come to scrap news Web site! There are many potential targets, but lets stop at http://news.google.com/. The goal is to get all top stories titles with an original link to the article.

This time we will use run command, which accepts script name and output file name, like this:

import spidy as ss
ss.run('examples/news.sp', 'examples/news.html')

First of all, we get XPaths for our data using procedure described in Gearing Up for Scraping:

  • news container : html/body/div[3]/div/div/div/div[3]/div/div/table/tbody/tr/td/div/div/div/div[2]
  • article element : news container + /div/div/div/div[2]/table/tbody/tr/td[2]/div/h2/a[1]
  • article title : article element + /span[1]
  • article link : article element + @href

We definitely can prepare shorter selectors, but lets simply use what FireBug gives us for now. Next, the script itself:

1
2     // cookies - off
3     // javascript - off
4
5     markup = '<html xmlns="http://www.w3.org/1999/xhtml"><head/><body><div>\n'
6
7     get 'http://news.google.com/news?pz=1&cf=all&ned=en_us&hl=en&q&js=0' as html:
8         User-Agent:'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.9; rv:27.0) Gecko/20100101 Firefox/27.0'
9
10    if & != None:
11        traverse div in &'html/body/div[3]/div/div/div/div[3]/div/div/table/tbody/tr/td/div/div/div/div[2]':
12
13            if &(div + '@class') == 'esc-separator':
14                continue
15
16            anchor_path  = div + '/div/div/div/div[2]/table/tbody/tr/td[2]/div/h2/a[1]'
17            anchor_ref   = &(anchor_path+'@href')
18            anchor_title = &(anchor_path+'/span[1]')
19            anchor       = ('\t<a href="' + anchor_ref + '">' + anchor_title + '</a><br/>\n')
20
21            markup = markup + anchor
22    else:
23        markup = (markup + '\t<p>failed :(</p>\n')
24
25    markup = (markup + '</div></body></html>')
26    return markup

Line 5 - initiate output HTML document in string variable markup, so the script returns HTML page. Article links will be appended to the variable.

Lines 7, 8 - we load GNews main page with the same header as my browser has. Your browser and corresponding user agent header might be different.

Line 11 - traverse helps to traverse structured documents. In it’s basic form it accepts path to container and enumerates it’s direct children XPaths - div variable. See more about it in Syntax Reference.

Lines 13, 14 - workaround for non-article elements inside of the main container.

Alright, this is all about it - hit Run and look for examples/news.html HTML file. Actually, we can make the script cleaner - see Cleaner Scripts with Templates to learn how.

Table Of Contents

Previous topic

Gearing Up for Scraping

Next topic

Cleaner Scripts with Templates

This Page