Developers Guide

Spiders in HEPcrawl

Here is a introduction to spiders: http://doc.scrapy.org/en/latest/topics/spiders.html

See also the official spider tutorial for Scrapy.

Spiders are classes which inherit from Scrapy Spider classes and contains the main logic of retrieval of content from the source and the extraction of metadata from the source records. All spiders are located under spiders/ folder and follows the naming standard mysource_spider.py.

Traditionally, we receive metadata in XML format so our spiders usually inherit from a special XML parsing spider from Scrapy called XMLFeedSpider.

from scrapy.spiders import XMLFeedSpider

class MySpider(XMLFeedSpider):

    name = "myspider"
    itertag = 'article'  # XML tag to iterate over within each XML file

    def start_requests(self):
        # Retrieval of all data

    def parse_node(self, response, node):
        # extraction from XML per record

When you create a new spider, you need to implement at least two methods:

  • start_requests: get the content from the source
  • parse_node: extract the metadata from the downloaded content into a record

Getting data with start_requests

start_requests handles the retrieval of data from the source, and yield each record(s) file (in this case XML file). You can even chain these requests to do things like following links.

For example in the World Scientific use case, we need to:

  1. Connect to a FTP server and get ZIP files
  2. For each ZIP file, extract it’s contents and parse every XML

So our start_requests function is first checking a remote FTP server for newly added zip files and yield’s the full FTP path to the zip file to Scrapy. Scrapy knows how to download things from an FTP and gets each zip file. While yielding we also tell Scrapy to call another function (handle_package) for each zip file. This is called a callback function and it is necessary to do step 2. from above.

This function then extracts all XML files from the zip files and finally yield each XML file (without any more callbacks) to Scrapy which now calls parse_node and extracting metadata from a single XML file can finally begin.

Creating records in parse_node

parse_node handles the extraction of records for a XMLFeedSpider into so-called items. An item is basically a intermediate record object where the data from the XML is put into.

The function iterates over the XML tag specified in itertag, which means that it supports multiple records inside a single XML file.

The goal of the parse_node function is to generate a HEPRecord, which in the Scrapy world is called item. This is defined in items.py and is an intermediate format of the record metadata that is extracted from every source. It tries to resemble the HEP JSON Schema as closely as feasible, with some exceptions.

class HEPRecord(scrapy.Item):

    title = scrapy.Field()
    abstract = scrapy.Field()
    page_nr = scrapy.Field()
    journal_artid = scrapy.Field()
    # etc..

To do the extraction, you are given a node object which is a selector on the XML record. You can now xpath (and even css) to extract content directly into the item, via some helper functions:

def parse_node(self, response, node):
    """Parse a XML file into a HEP record."""

    # for simplicity, remove all namespaces (optional)
    node.remove_namespaces()

    # Create a HEPRecord object with an special loader (more on this later)
    record = HEPLoader(item=HEPRecord(), selector=node, response=response)

    record.add_xpath('page_nr', "//counts/page-count/@count")
    record.add_xpath('abstract', '//abstract[1]')
    record.add_xpath('title', '//article-title/text()')

    return record.load_item()

Here you see that you can directly assign a value to the HEPRecord via the add_xpath function, but you are not forced to do so:

fpage = node.xpath('.//fpage/text()').extract()
lpage = node.xpath('.//lpage/text()').extract()
if fpage:
    record.add_value('journal_fpage', fpage)
if lpage:
    record.add_value('journal_lpage', lpage)

NOTE: The value added when using add_xpath usually resolves into a Python list of values. So remember that you need to deal with lists.

Using the add_value you can add the value you want to a field when you need to do some extra logic.

Re-using common metadata handling using item loaders

Since INSPIRE has multiple sources of content we will need to have multiple spiders that retrieves and extracts data differently. However, the intermediate HEPRecord is the common output of all sources.

This means that any additional metadata handling, such as converting journal titles or author names to the correct format can be done in one place only. This is managed in the HEPLoader item loader located in loaders.py.

The loader defines the input and output processors for the HEPRecord item. The input processor processes the extracted data as soon as it’s received (through the add_xpath(), add_css() or add_value() methods). The output processor takes the data processed by the input processors and assigns them to the field in the item.

For example, a HEPRecord has a field called abstract for the abstract. We want to take the incoming abstract string and convert some HTML tags to their LaTeX counterparts. First we define our input processor inside inputs.py:

def convert_html_subscripts_to_latex(text):
    """Convert some HTML tags to latex equivalents."""
    text = re.sub("<sub>(.*?)</sub>", r"$_{\1}$", text)
    text = re.sub("<sup>(.*?)</sup>", r"$^{\1}$", text)
    return text

Then we add our input processor to the HEPLoader:

from scrapy.loader.processors import MapCompose
from .inputs import convert_html_subscripts_to_latex

class HEPLoader(ItemLoader):

    abstract_in = MapCompose(
        convert_html_subscripts_to_latex,
        unicode.strip,
    )

To automatically link the input processors to the correct item field, we add the suffix _in to the field name. Then we use a special processor called MapCompose which takes functions as parameters and they will each be called with each value in the field.

record.add_xpath('abstract', '//abstract[1]')

will add a list with one item:

[".. some abstract from the XML .."]

The input processors like convert_html_subscripts_to_latex is then called with ".. some abstract from the XML .." (per value, not the whole list).

We can also define output processors to control how the values are assigned to the fields in the items. For example, instead of a list only assign the first value in the list:

from scrapy.loader.processors import MapCompose, TakeFirst
from .inputs import convert_html_subscripts_to_latex

class HEPLoader(ItemLoader):

    abstract_in = MapCompose(
        convert_html_subscripts_to_latex,
        unicode.strip,
    )
    abstract_out = TakeFirst()

Take a look here for some useful concepts when dealing with item loaders.

Exporting the final record with item pipelines

Finally, the data in the items are exported to INSPIRE via special item pipelines.

These classes are located under pipelines.py and exports harvested records to JSON files and pushes them to INSPIRE-HEP.

This documentation is still work in progress