API

Call a crawler from Python.

hepcrawl.run(spider_name)[source]

Run specific spider.

Items

Item models for scraped HEP records.

See documentation about items in: http://doc.scrapy.org/en/latest/topics/items.html

class hepcrawl.items.HEPRecord(*args, **kwargs)[source]

HEPRecord represents a generic HEP record based on HEP JSON schema.

This is not a 1-to-1 mapping to the HEP JSON schema.

This is a bit flatter structure that will be transformed before export to INSPIRE. For complex fields, like authors, please refer to the HEP JSON Schema for details.

abstract = None

Abstract of the record, e.g. ‘We study the dynamics of quantum...’.

abstracts = None

Final structure of abstract information. DO NOT ADD DATA TO THIS FIELD.

acquisition_source = None

Source of the record in the acquisition_source format.

additional_files = None

Files (fulltexts, package) belonging to this item.

[{
    "type": "Fulltext",  # Fulltext, Supplemental, Data, Figure
    "uri": "file:///path/to/file",  # can also be HTTP
}]
arxiv_eprints = None

ArXiv E-print information

{
    "value": "1506.00647",
    "categories": ['hep-ph', 'hep-lat', 'nucl-th']
}
authors = None

Special author format which will transform the incoming raw data to correct formats. For example, by handling initials and full name etc.

List of authors of this form:

[{
    "surname": "Ellis",
    "given_names": "Richard John",
    "full_name": "", # if no surname/given_names
    "affiliations": [{
        value: "raw string", ..
    }]
}, ..]
classification_numbers = None

Classification numbers like PACS numbers.

[
    {
        'classification_number': 'FOO',
        'standard': 'PACS'
    }, ...
]
collaborations = None

A list of the record collaborations, if any.

[
    'Planck Collaboration'
]
collections = None

List of collections article belongs to. E.g. [‘CORE’, ‘THESIS’].

copyright = None

Final structure for copyright information.

date_published = None

Date of publication in string format, e.g. ‘2016-01-14’.

dois = None

DOIs

[{
    'value': '10.1103/PhysRevD.93.016005'
}]
external_system_numbers = None

External System Numbers

[
    {
        "institute": "SPIRESTeX",
        "value": "Mayrhofer:2012zy"
    },
    {
        "institute": "arXiv",
        "value": "oai:arXiv.org:1211.6742"
    }
]
extra_data = None

Extra data belonging to this item that will NOT be part of final record.

{
    "foo": "bar"
}
file_urls = None

List of files to be downloaded with FilesPipeline and added to files.

files = None

List of downloaded files by FilesPipeline.

free_keywords = None

Free keywords

[
    {
        'value': 'Physics',
        'source': 'author'
    }, ...
]
imprints = None

Structure for imprint information.

journal_doctype = None

Special type of publication. E.g. “Erratum”, “Addendum”.

page_nr = None

Page number as string. E.g. ‘2’.

preprint_date = None

Date of preprint release.

pubinfo_freetext = None

Raw journal reference string.

public_notes = None

Notes

[
    {
        "source": "arXiv",
        "value": "46 pages, 3 figures; v2 typos corrected, citations added"
    }
]
publication_info = None

Structured publication information.

references = None

List of references in the following form:

[{
    'recid': '',
    'texkey': '',
    'doi': '',
    'collaboration': [],
    'editors': [],
    'authors': [],
    'misc': [],
    'number': 0,
    'isbn': '',
    'publisher': [],
    'maintitle': '',
    'report_number': [],
    'title': [],
    'url': [],
    'journal_pubnote': [".*,.*,.*(,.*)?"],
    'raw_reference': [],
    'year': 2016,
}, ..]
related_article_doi = None

DOI of Addendum/Erratum

[{
    'value': '10.1103/PhysRevD.93.016005'
}]
report_numbers = None

Structure for report_numbers, e.g. [‘CERN-001’, ‘DESY-002’].

source = None

Source of the record, e.g. ‘World Scientific’. Used across many fields.

subtitle = None

Sub-title of the record, e.g. ‘A treatese on the universe’.

thesis = None

Thesis information

[{
    'date': '',
    'defense_date': '',
    'institutions': [],
    'degree_type': '',
}]
title = None

Title of the record, e.g. ‘Perturbative Renormalization of Neutron-Antineutron Operators’.

titles = None

List of title structures.

urls = None

URLs to splash page.

['http://hdl.handle.net/1885/10005']

Spiders

ALPHA

Spider for ALPHA.

class hepcrawl.spiders.alpha_spider.AlphaSpider(source_file=None, *args, **kwargs)[source]

Alpha crawler Scrapes theses metadata from Alpha experiment web page. http://alpha.web.cern.ch/publications#thesis

  1. parse() iterates through every record on the html page and yields a HEPRecord.

Example usage: .. code-block:: console

scrapy crawl alpha -s “JSON_OUTPUT_DIR=tmp/” scrapy crawl alpha -a source_file=file://pwd/tests/responses/alpha/test_1.htm -s “JSON_OUTPUT_DIR=tmp/”

Happy crawling!

get_abstract(thesis)[source]

Returns a unified abstract, if divided to multiple paragraphs.

parse(response)[source]

Parse Alpha web page into a HEP record.

parse_author_data(thesis)[source]

Parses the line where there are data about the author(s)

start_requests()[source]

You can also run the spider on local test files

arXiv.org

Spider for arXiv.

class hepcrawl.spiders.arxiv_spider.ArxivSpider(source_file=None, **kwargs)[source]

Spider for crawling arXiv.org OAI-PMH XML files.

scrapy crawl arXiv -a source_file=file://`pwd`/tests/responses/arxiv/sample_arxiv_record.xml
parse_node(response, node)[source]

Parse an arXiv XML exported file into a HEP record.

BASE

Spider for BASE.

class hepcrawl.spiders.base_spider.BaseSpider(source_file=None, *args, **kwargs)[source]

BASE crawler Scrapes BASE metadata XML files one at a time. The actual files should be retrieved from BASE viat its OAI interface. The file can contain multiple records. This spider harvests only theses. It takes one BASE metadata record which are stored in an XML file.

  1. First a request is sent to parse_node() to look through the XML file and determine if it has direct link(s) to a fulltext pdf. (Actually it doesn’t recognize fulltexts; it’s happy when it sees a pdf of some kind.) calls: parse_node()
2a.If direct link exists, it will call build_item() to extract all desired
data from the XML file. Data will be put to a HEPrecord item and sent to a pipeline for processing. calls: build_item()
2b.If no direct link exists, it will send a request to scrape_for_pdf() to
follow links and extract the pdf url. It will then call build_item() to build HEPrecord. calls: scrape_for_pdf(), then build_item()

Example usage: .. code-block:: console

scrapy crawl BASE -a source_file=file://pwd/tests/responses/base/test_1.xml -s “JSON_OUTPUT_DIR=tmp/”

Happy crawling!

build_item(response)[source]

Build the final record.

Determine if the XML file has a direct link.

static get_authors(node)[source]

Get the authors.

Probably there is only one author but it’s not necessarily in the creator element. If it’s only in the contributor element, it’s impossible to detect unless it’s explicitly declared as an author name.

static get_title(node)[source]

Get the title and possible subtitle.

static get_urls_in_record(node)[source]

Return all the different urls in the xml.

Urls might be stored in identifier, relation, or link element. Beware the strange “filename.jpg.pdf” urls.

parse_node(response, node)[source]

Iterate through all the record nodes in the XML.

With each node it checks if direct link exists, and sends a request to scrape the direct link or calls build_item() to build the HEPrecord.

scrape_for_pdf(response)[source]

Scrape splash page for any links to PDFs.

If direct link didn’t exists, parse_node() will yield a request here to scrape the urls. This will find a direct pdf link from a splash page, if it exists. Then it will ask build_item to build the HEPrecord.

start_requests()[source]

Default starting point for scraping shall be the local XML file

Brown

Spider for Brown University Digital Repository

class hepcrawl.spiders.brown_spider.BrownSpider(source_file=None, *args, **kwargs)[source]

Brown crawler Scrapes theses metadata from Brown Digital Repository JSON file https://repository.library.brown.edu/api/collections/355/

Browse the dissertations: https://repository.library.brown.edu/studio/collections/id_355/

  1. parse() iterates through every record on the JSON file and yields a HEPRecord (or a request to scrape for the pdf file if link exists).

Example usage: .. code-block:: console

scrapy crawl brown -s “JSON_OUTPUT_DIR=tmp/” scrapy crawl brown -a source_file=file://pwd/tests/responses/brown/test_1.json -s “JSON_OUTPUT_DIR=tmp/”

Happy crawling!

build_item(response)[source]

Build the final record.

parse(response)[source]

Go through every record in the JSON and. If link to splash page exists, go scrape. If not, create a record with the available data.

scrape_splash(response)[source]

Scrape splash page for links to PDFs, author name, copyright date, thesis info and page numbers.

start_requests()[source]

You can also run the spider on local test files.

DNB

Spider for DNB Dissonline.

class hepcrawl.spiders.dnb_spider.DNBSpider(source_file=None, *args, **kwargs)[source]

DNB crawler Scrapes Deutsche National Bibliotek metadata XML files one at a time. The actual files should be retrieved from DNB viat its OAI interface. The file can contain multiple records. This spider harvests only theses.

This spider takes DNB metadata records which are stored in an XML file.

  1. The spider will parse the local MARC21XML format file for record data
  2. If a link to the original repository splash page exists, parse_node will yield a request to scrape for abstract. This will only be done to a few selected repositories (at least for now).
  3. Finally a HEPRecord will be created in build_item.

Example usage: .. code-block:: console

scrapy crawl DNB -a source_file=file://pwd/tests/responses/dnb/test_1.xml -s “JSON_OUTPUT_DIR=tmp/”

Happy crawling!

build_item(response)[source]

Build the final record.

Determine if the XML file has a direct link.

static get_affiliations(node)[source]

Cleans the affiliation element.

get_authors(node)[source]

Gets the authors.

static get_thesis_supervisors(node)[source]

Create a structured supervisor dictionary.

static get_urls_in_record(node)[source]

Return all the different urls in the xml.

parse_node(response, node)[source]

Iterate through all the record nodes in the XML.

With each node it checks if splash page link exists, and sends a request to scrape the abstract or calls build_item to build the HEPrecord.

scrape_for_abstract(response)[source]

Scrape splash page for abstracts.

If splash page link exists, parse_node will yield a request here to scrape the abstract (and page number). Note that all the splash pages are different. Then it will ask build_item to build the HEPrecord.

start_requests()[source]

Default starting point for scraping shall be the local XML file.

Elsevier

Spider for Elsevier.

INFN

Spider for INFN.

class hepcrawl.spiders.infn_spider.InfnSpider(source_file=None, year='2017', *args, **kwargs)[source]

INFN crawler Scrapes theses metadata from INFN web page. http://www.infn.it/thesis/index.php

  1. If not local html file given, get_list_file gets one using POST requests. Year is given as a argument, default is current year.
  2. parse_node() iterates through every record on the html page.
  3. If no pdf links are found, request to scrape the splash page is returned.
  4. In the end, a HEPRecord is built.

Example usage: .. code-block:: console

scrapy crawl infn scrapy crawl infn -a source_file=file://pwd/tests/responses/infn/test_1.html -s “JSON_OUTPUT_DIR=tmp/” scrapy crawl infn -a year=1999 -s “JSON_OUTPUT_DIR=tmp/”

Happy crawling!

add_fft_file(pdf_files, file_access, file_type)[source]

Create a structured dictionary to add to ‘files’ item.

build_item(response)[source]

Build the final HEPRecord item.

get_authors(node)[source]

Return authors dictionary

get_list_file(year)[source]

Get data out of the query web page and save it locally.

get_thesis_info(node)[source]

Create thesis info dictionary.

static get_thesis_supervisors(node)[source]

Create a structured supervisor dictionary.

parse_node(response, node)[source]

Parse INFN web page into a HEP record.

scrape_splash(response)[source]

Scrape INFN web page for more metadata.

start_requests()[source]

You can also run the spider on local test files

IOP

Spider for IOP.

class hepcrawl.spiders.iop_spider.IOPSpider(zip_file=None, xml_file=None, pdf_files=None, *args, **kwargs)[source]

IOPSpider crawler.

This spider should first be able to harvest files from IOP STACKS (http://stacks.iop.org/Member/). Then it should scrape through the files and get the things we want.

XML files are in NLM PubMed format: http://www.ncbi.nlm.nih.gov/books/NBK3828/#publisherhelp.XML_Tag_Descriptions Examples: http://www.ncbi.nlm.nih.gov/books/NBK3828/#publisherhelp.Example_of_a_Standard_XML

  1. Fetch gzipped data packages from STACKS
  2. Scrape the XML files inside.
  3. Return valid JSON records.

You can also call this spider directly on gzip package or an XML file. If called without arguments, it will attempt to fetch files from STACKS.

Example usage: .. code-block:: console

scrapy crawl iop -a xml_file=file://pwd/tests/responses/iop/xml/test_standard.xml scrapy crawl iop -a zip_file=file://pwd/tests/responses/iop/packages/test.tar.gz -a xml_file=file://pwd/tests/responses/iop/xml/test_standard.xml scrapy crawl iop -a pdf_files=`pwd`/tests/responses/iop/pdf/ -a xml_file=file://pwd/tests/responses/iop/xml/test_standard.xml

for JSON output, add -s “JSON_OUTPUT_DIR=tmp/” for logging, add -s “LOG_FILE=iop.log”

Happy crawling!

add_fft_file(file_path, file_access, file_type)[source]

Create a structured dictionary and add to ‘files’ item.

get_pdf_path(vol, issue, fpage)[source]

Get path for the correct pdf.

handle_package(zip_file)[source]

Extract all the pdf files in the gzip package.

parse_node(response, node)[source]

Parse the record XML and create a HEPRecord.

start_requests()[source]

Spider can be run on a record XML file. In addition, a gzipped package containing PDF files or the path to the pdf files can be given.

If no arguments are given, it should try to get the package from STACKS.

static untar_files(zip_filepath, target_folder)[source]

Unpack a tar.gz package while flattening the dir structure. Return list of pdf paths.

MAGIC

Spider for MAGIC.

class hepcrawl.spiders.magic_spider.MagicSpider(source_file=None, *args, **kwargs)[source]

MAGIC crawler Scrapes theses metadata from MAGIC telescope web page. https://magic.mpp.mpg.de/backend/publications/thesis

  1. parse_node will get thesis title, author and date from the listing.
  2. If link to the splash page exists, scrape_for_pdf will try to fetch the pdf link, abstract, and authors.
  3. build_item will build the HEPRecord.

Example usage: .. code-block:: console

scrapy crawl magic scrapy crawl magic -a source_file=file://pwd/tests/responses/magic/test_list.html -s “JSON_OUTPUT_DIR=tmp/”

Happy crawling!

add_fft_file(pdf_files, file_access, file_type)[source]

Create a structured dictionary and add to ‘files’ item.

build_item(response)[source]

Build the final HEPRecord

static get_authors(node)[source]

Parses the line where there are data about the author(s)

Note that author surnames and given names are not comma separated, so split_fullname might get a wrong surname.

Return full http path(s) to the splash page

parse_node(response, node)[source]

Parse MAGIC web page into a HEP record.

scrape_for_pdf(response)[source]

Scrape for pdf link and abstract.

start_requests()[source]

You can also run the spider on local test files

MIT

Spider for MIT.

class hepcrawl.spiders.mit_spider.MITSpider(source_file=None, year='2017', *args, **kwargs)[source]

MIT crawler Scrapes theses metadata from DSpace@MIT (Dept. of Physics dissertations). http://dspace.mit.edu/handle/1721.1/7608/browse

  1. get_list_file makes post requests to get list of records as a html file. Defaults are to take the current year and 100 records per file.
  2. parse iterates through every record on the html page and yields a request to scrape full metadata.
  3. build_item builds the final HEPRecord.

Example usage: .. code-block:: console

scrapy crawl MIT scrapy crawl MIT -a year=1999 -s “JSON_OUTPUT_DIR=tmp/”

Happy crawling!

add_fft_file(pdf_files, file_access, file_type)[source]

Create a structured dictionary to add to ‘files’ item.

build_item(response)[source]

Scrape MIT full metadata page and build the final HEPRecord item.

static get_authors(node)[source]

Return authors dictionary

get_list_file(year, n=100)[source]

Get data out of the query web page and save it locally.

static get_page_nr(node)[source]

Get and format the page numbers. Return only digits.

static get_thesis_info(node)[source]

Create thesis info dictionary.

static get_thesis_supervisors(node)[source]

Create a structured supervisor dictionary.

There might be multiple supervisors.

parse_node(response, node)[source]

Parse MIT thesis listing and find links to record splash pages.

start_requests()[source]

You can also run the spider on local test files

PHENIX

Spider for PHENIX.

class hepcrawl.spiders.phenix_spider.PhenixSpider(source_file=None, *args, **kwargs)[source]

PHENIX crawler Scrapes theses metadata from PHENIX experiment web page. http://www.phenix.bnl.gov/WWW/talk/theses.php

  1. parse() iterates through every record on the html page and yields a HEPRecord.

Example usage: .. code-block:: console

scrapy crawl phenix scrapy crawl phenix -a source_file=file://pwd/tests/responses/phenix/test_list.html -s “JSON_OUTPUT_DIR=tmp/”

Happy crawling!

add_fft_file(pdf_files, file_access, file_type)[source]

Create a structured dictionary and add to ‘files’ item.

get_authors(node)[source]

Return authors dictionary

static parse_datablock(node)[source]

Get data out of the text block where there’s title, affiliation and year

parse_node(response, node)[source]

Parse PHENIX web page into a HEP record.

start_requests()[source]

You can also run the spider on local test files

Philpapers.org

Spider for Philpapers.org

class hepcrawl.spiders.phil_spider.PhilSpider(source_file=None, *args, **kwargs)[source]

Phil crawler Scrapes theses metadata from Philpapers.org JSON file.

  1. parse() iterates through every record on the JSON file and yields a HEPRecord (or a request to scrape for the pdf file if link exists).

Example usage: .. code-block:: console

scrapy crawl phil -s “JSON_OUTPUT_DIR=tmp/” scrapy crawl phil -a source_file=file://pwd/tests/responses/phil/test_thesis.json -s “JSON_OUTPUT_DIR=tmp/”

Happy crawling!

build_item(response)[source]

Build the final record.

get_authors(author_element)[source]

Parses the line where there are data about the author(s).

get_date(record)[source]

Return a standard format date.

YYYY-MM-DD, YYYY-MM or YYYY.

parse(response)[source]

Parse Philpapers JSON file into a HEP record.

scrape_for_pdf(response)[source]

Scrape splash page for any links to PDFs.

If direct link didn’t exists, parse_node() will yield a request here to scrape the urls. This will find a direct pdf link from a splash page, if it exists. Then it will ask build_item to build the HEPrecord.

start_requests()[source]

You can also run the spider on local test files.

Proceedings of Science

Spider for POS.

class hepcrawl.spiders.pos_spider.POSSpider(source_file=None, **kwargs)[source]

POS/Sissa crawler.

Extracts from metadata: title, article-id, conf-acronym, authors, affiliations, publication-date, publisher, license, language, link

scrapy crawl PoS -a source_file=file://`pwd`/tests/responses/pos/sample_pos_record.xml
build_item(response)[source]

Parse an PoS XML exported file into a HEP record.

parse(response)[source]

Get PDF information.

scrape_pos_page(response)[source]

Parse a page for PDF link.

T2K

Spider for T2K.

class hepcrawl.spiders.t2k_spider.T2kSpider(source_file=None, *args, **kwargs)[source]

T2K crawler Scrapes theses metadata from T2K experiment web page. http://www.t2k.org/docs/thesis

  1. parse_node will get thesis title, author and date from the listing.
  2. If link to the splash page exists, scrape_for_pdf will try to fetch the pdf link and possibly also the abstract.
  3. build_item will build the HEPRecord.

Example usage: .. code-block:: console

scrapy crawl t2k scrapy crawl t2k -a source_file=file://pwd/tests/responses/t2k/test_list.html -s “JSON_OUTPUT_DIR=tmp/” scrapy crawl t2k -a source_file=file://pwd/tests/responses/t2k/test_1.html -s “JSON_OUTPUT_DIR=tmp/”

Happy crawling!

add_fft_file(pdf_files, file_access, file_type)[source]

Create a structured dictionary and add to ‘files’ item.

build_item(response)[source]

Build the final HEPRecord

static get_authors(node)[source]

Parses the line where there are data about the author(s)

Note that author surnames and given names are not comma separated, so split_fullname might get a wrong surname.

Return full http path(s) to the splash page

parse_node(response, node)[source]

Parse Alpha web page into a HEP record.

scrape_for_pdf(response)[source]

Scrape for pdf link and abstract.

start_requests()[source]

You can also run the spider on local test files

World Scientific

Spider for World Scientific.

class hepcrawl.spiders.wsp_spider.WorldScientificSpider(package_path=None, ftp_folder='WSP', ftp_host=None, ftp_netrc=None, *args, **kwargs)[source]

World Scientific Proceedings crawler.

This spider connects to a given FTP hosts and downloads zip files with XML files for extraction into HEP records.

This means that it generates the URLs for Scrapy to crawl in a special way:

  1. First it connects to a FTP host and lists all the new ZIP files found on the remote server and downloads them to a designated local folder, using start_requests().
  2. Then the ZIP file is unpacked and it lists all the XML files found inside, via handle_package(). Note the callback from start_requests()
  3. Finally, now each XML file is parsed via parse_node().

To run a crawl, you need to pass FTP connection information via ftp_host and ftp_netrc:``

scrapy crawl WSP -a 'ftp_host=ftp.example.com' -a 'ftp_netrc=/path/to/netrc'

Happy crawling!

handle_package_file(response)[source]

Handle a local zip package and yield every XML.

handle_package_ftp(response)[source]

Handle a zip package and yield every XML found.

parse_node(response, node)[source]

Parse a WSP XML file into a HEP record.

start_requests()[source]

List selected folder on remote FTP and yield new zip files.