..
This file is part of hepcrawl.
Copyright (C) 2015, 2016 CERN.
hepcrawl is a free software; you can redistribute it and/or modify it
under the terms of the Revised BSD License; see LICENSE file for
more details.
.. currentmodule:: hepcrawl
Developers Guide
================
Spiders in HEPcrawl
-------------------
Here is a introduction to spiders: http://doc.scrapy.org/en/latest/topics/spiders.html
See also the official spider `tutorial`_ for Scrapy.
Spiders are classes which inherit from Scrapy Spider classes and contains the main
logic of retrieval of content from the source and the extraction of metadata
from the source records. All spiders are located under ``spiders/`` folder and
follows the naming standard `mysource_spider.py`.
Traditionally, we receive metadata in XML format so our spiders usually inherit
from a special XML parsing spider from Scrapy called ``XMLFeedSpider``.
.. code-block:: python
from scrapy.spiders import XMLFeedSpider
class MySpider(XMLFeedSpider):
name = "myspider"
itertag = 'article' # XML tag to iterate over within each XML file
def start_requests(self):
# Retrieval of all data
def parse_node(self, response, node):
# extraction from XML per record
When you create a new spider, you need to implement at least two methods:
* `start_requests`: get the content from the source
* `parse_node`: extract the metadata from the downloaded content into a record
Getting data with ``start_requests``
------------------------------------
``start_requests`` handles the retrieval of data from the source,
and `yield`_ each record(s) file (in this case XML file). You can even chain
these requests to do things like `following links`_.
For example in the World Scientific use case, we need to:
1. Connect to a FTP server and get ZIP files
2. For each ZIP file, extract it's contents and parse every XML
So our ``start_requests`` function is first checking a remote FTP server for
newly added zip files and yield's the full FTP path to the zip file to Scrapy.
Scrapy knows how to download things from an FTP and gets each zip file. While
yielding we also tell Scrapy to call another function (``handle_package``)
for each zip file. This is called a `callback`_ function and it is necessary to
do step 2. from above.
This function then extracts all XML files from the zip files and finally `yield`
each XML file (without any more callbacks) to Scrapy which now calls ``parse_node``
and extracting metadata from a single XML file can finally begin.
.. _callback: http://doc.scrapy.org/en/latest/topics/request-response.html?highlight=callback
.. _yield: http://anandology.com/python-practice-book/iterators.html#generators
.. _following links: http://doc.scrapy.org/en/latest/intro/tutorial.html#following-links
.. _tutorial: http://doc.scrapy.org/en/latest/intro/tutorial.html#our-first-spider
Creating records in ``parse_node``
----------------------------------
``parse_node`` handles the extraction of records for a ``XMLFeedSpider``
into so-called `items`_. An item is basically a intermediate record object
where the data from the XML is put into.
The function iterates over the XML tag specified in ``itertag``, which means
that it supports multiple records inside a single XML file.
The goal of the ``parse_node`` function is to generate a ``HEPRecord``, which in
the Scrapy world is called item. This is defined in ``items.py`` and is an
**intermediate** format of the record metadata that is extracted from every
source. It tries to resemble the HEP JSON Schema as closely as
feasible, with some exceptions.
.. code-block:: python
class HEPRecord(scrapy.Item):
title = scrapy.Field()
abstract = scrapy.Field()
page_nr = scrapy.Field()
journal_artid = scrapy.Field()
# etc..
To do the extraction, you are given a ``node`` object which is a `selector`_ on
the XML record. You can now xpath (and even css) to extract content directly
into the item, via some helper functions:
.. code-block:: python
def parse_node(self, response, node):
"""Parse a XML file into a HEP record."""
# for simplicity, remove all namespaces (optional)
node.remove_namespaces()
# Create a HEPRecord object with an special loader (more on this later)
record = HEPLoader(item=HEPRecord(), selector=node, response=response)
record.add_xpath('page_nr', "//counts/page-count/@count")
record.add_xpath('abstract', '//abstract[1]')
record.add_xpath('title', '//article-title/text()')
return record.load_item()
Here you see that you can directly assign a value to the ``HEPRecord`` via
the ``add_xpath`` function, but you are not forced to do so:
.. code-block:: python
fpage = node.xpath('.//fpage/text()').extract()
lpage = node.xpath('.//lpage/text()').extract()
if fpage:
record.add_value('journal_fpage', fpage)
if lpage:
record.add_value('journal_lpage', lpage)
NOTE: The value added when using ``add_xpath`` usually resolves into a Python list of values.
So remember that you need to deal with lists.
Using the ``add_value`` you can add the value you want to a field when you need
to do some extra logic.
.. _items: http://doc.scrapy.org/en/latest/topics/items.html
.. _selector: http://doc.scrapy.org/en/latest/topics/selectors.html
Re-using common metadata handling using item loaders
----------------------------------------------------
Since INSPIRE has multiple sources of content we will need to have multiple spiders
that retrieves and extracts data differently. However, the intermediate ``HEPRecord``
is the common output of all sources.
This means that any additional metadata handling, such as converting journal titles
or author names to the correct format can be done in one place only. This is managed
in the ``HEPLoader`` `item loader`_ located in ``loaders.py``.
The loader defines the `input and output processors`_ for the ``HEPRecord`` item.
The input processor processes the extracted data as soon as it’s received
(through the add_xpath(), add_css() or add_value() methods). The output processor
takes the data processed by the input processors and assigns them to the field in
the item.
For example, a ``HEPRecord`` has a field called ``abstract`` for the abstract. We want
to take the incoming abstract string and convert some HTML tags to their LaTeX
counterparts. First we define our input processor inside ``inputs.py``:
.. code-block:: python
def convert_html_subscripts_to_latex(text):
"""Convert some HTML tags to latex equivalents."""
text = re.sub("(.*?)", r"$_{\1}$", text)
text = re.sub("(.*?)", r"$^{\1}$", text)
return text
Then we add our input processor to the ``HEPLoader``:
.. code-block:: python
from scrapy.loader.processors import MapCompose
from .inputs import convert_html_subscripts_to_latex
class HEPLoader(ItemLoader):
abstract_in = MapCompose(
convert_html_subscripts_to_latex,
unicode.strip,
)
To automatically link the input processors to the correct item field, we add the
suffix ``_in`` to the field name. Then we use a special processor called
``MapCompose`` which takes functions as parameters and they will each be called
with each value in the field.
.. code-block:: python
record.add_xpath('abstract', '//abstract[1]')
will add a list with one item:
.. code-block:: python
[".. some abstract from the XML .."]
The input processors like `convert_html_subscripts_to_latex` is then called
with ``".. some abstract from the XML .."`` (per value, not the whole list).
.. _item loader: http://doc.scrapy.org/en/latest/topics/loaders.html
.. _input and output processors: http://doc.scrapy.org/en/latest/topics/loaders.html#declaring-input-and-output-processors
We can also define output processors to control how the values are assigned to the fields
in the items. For example, instead of a list only assign the first value in the list:
.. code-block:: python
from scrapy.loader.processors import MapCompose, TakeFirst
from .inputs import convert_html_subscripts_to_latex
class HEPLoader(ItemLoader):
abstract_in = MapCompose(
convert_html_subscripts_to_latex,
unicode.strip,
)
abstract_out = TakeFirst()
Take a look `here`_ for some useful concepts when dealing with item loaders.
.. _here: http://doc.scrapy.org/en/latest/topics/loaders.html#reusing-and-extending-item-loaders
Exporting the final record with item pipelines
----------------------------------------------
Finally, the data in the items are exported to INSPIRE via special item pipelines.
These classes are located under ``pipelines.py`` and exports harvested records to
JSON files and pushes them to INSPIRE-HEP.
**This documentation is still work in progress**