.. _scripting: =============== Writing Scripts =============== It's time to write first serious script. Spidy script is a file with \*.sp extension. Yes, Spidy claims \*.sp extension for its script files and \*.spt - for templates used by the scripts. First, lets briefly speak about fundamentals. Typical Script ============== Typical script consists of the following snippets: - loading document and checking results - skip to branch where data is located (optional) - applying selectors or traversing document to collect data - preparing and returning results Documents and Formats ===================== Loading document is an important first step, if it fails the whole script may be aborted. To accomplish that, Spidy has a very cool command - ``get``. It sends simple Web GET request and returns response. If command succeeds, document becomes available for scraping. By default, Spidy sniffs document format from extension. If URL doesn't have one - it should be explicitly specified using ``as`` operator. At the moment the following formats are supported: - HTML - XML - TXT - JSON .. note:: When loading document as text (TXT), ``path``, ``skip`` and ``traverse`` statements will not work since response is treated as not structured document. News Example ============ Alright, the time has come to scrap news Web site! There are many potential targets, but lets stop at http://news.google.com/. The goal is to get all top stories titles with an original link to the article. This time we will use ``run`` command, which accepts script name and output file name, like this:: import spidy as ss ss.run('examples/news.sp', 'examples/news.html') First of all, we get XPaths for our data using procedure described in :ref:`guidelines`: - news container : ``html/body/div[3]/div/div/div/div[3]/div/div/table/tbody/tr/td/div/div/div/div[2]`` - article element : news container + ``/div/div/div/div[2]/table/tbody/tr/td[2]/div/h2/a[1]`` - article title : article element + ``/span[1]`` - article link : article element + ``@href`` We definitely can prepare shorter selectors, but lets simply use what FireBug gives us for now. Next, the script itself:: 1 2 // cookies - off 3 // javascript - off 4 5 markup = '
failed :(
\n') 24 25 markup = (markup + '