API¶
Call a crawler from Python.
Items¶
Item models for scraped HEP records.
See documentation about items in: http://doc.scrapy.org/en/latest/topics/items.html
-
class
hepcrawl.items.
HEPRecord
(*args, **kwargs)[source]¶ HEPRecord represents a generic HEP record based on HEP JSON schema.
This is not a 1-to-1 mapping to the HEP JSON schema.
This is a bit flatter structure that will be transformed before export to INSPIRE. For complex fields, like authors, please refer to the HEP JSON Schema for details.
-
abstract
= None¶ Abstract of the record, e.g. ‘We study the dynamics of quantum...’.
-
abstracts
= None¶ Final structure of abstract information. DO NOT ADD DATA TO THIS FIELD.
-
acquisition_source
= None¶ Source of the record in the acquisition_source format.
-
additional_files
= None¶ Files (fulltexts, package) belonging to this item.
[{ "type": "Fulltext", # Fulltext, Supplemental, Data, Figure "uri": "file:///path/to/file", # can also be HTTP }]
-
arxiv_eprints
= None¶ ArXiv E-print information
{ "value": "1506.00647", "categories": ['hep-ph', 'hep-lat', 'nucl-th'] }
Special author format which will transform the incoming raw data to correct formats. For example, by handling initials and full name etc.
List of authors of this form:
[{ "surname": "Ellis", "given_names": "Richard John", "full_name": "", # if no surname/given_names "affiliations": [{ value: "raw string", .. }] }, ..]
-
classification_numbers
= None¶ Classification numbers like PACS numbers.
[ { 'classification_number': 'FOO', 'standard': 'PACS' }, ... ]
-
collaborations
= None¶ A list of the record collaborations, if any.
[ 'Planck Collaboration' ]
-
collections
= None¶ List of collections article belongs to. E.g. [‘CORE’, ‘THESIS’].
-
copyright
= None¶ Final structure for copyright information.
-
date_published
= None¶ Date of publication in string format, e.g. ‘2016-01-14’.
-
dois
= None¶ DOIs
[{ 'value': '10.1103/PhysRevD.93.016005' }]
-
external_system_numbers
= None¶ External System Numbers
[ { "institute": "SPIRESTeX", "value": "Mayrhofer:2012zy" }, { "institute": "arXiv", "value": "oai:arXiv.org:1211.6742" } ]
-
extra_data
= None¶ Extra data belonging to this item that will NOT be part of final record.
{ "foo": "bar" }
-
file_urls
= None¶ List of files to be downloaded with FilesPipeline and added to files.
-
files
= None¶ List of downloaded files by FilesPipeline.
-
free_keywords
= None¶ Free keywords
[ { 'value': 'Physics', 'source': 'author' }, ... ]
-
imprints
= None¶ Structure for imprint information.
-
journal_doctype
= None¶ Special type of publication. E.g. “Erratum”, “Addendum”.
-
page_nr
= None¶ Page number as string. E.g. ‘2’.
-
preprint_date
= None¶ Date of preprint release.
-
pubinfo_freetext
= None¶ Raw journal reference string.
-
public_notes
= None¶ Notes
[ { "source": "arXiv", "value": "46 pages, 3 figures; v2 typos corrected, citations added" } ]
-
publication_info
= None¶ Structured publication information.
-
references
= None¶ List of references in the following form:
[{ 'recid': '', 'texkey': '', 'doi': '', 'collaboration': [], 'editors': [], 'authors': [], 'misc': [], 'number': 0, 'isbn': '', 'publisher': [], 'maintitle': '', 'report_number': [], 'title': [], 'url': [], 'journal_pubnote': [".*,.*,.*(,.*)?"], 'raw_reference': [], 'year': 2016, }, ..]
DOI of Addendum/Erratum
[{ 'value': '10.1103/PhysRevD.93.016005' }]
-
report_numbers
= None¶ Structure for report_numbers, e.g. [‘CERN-001’, ‘DESY-002’].
-
source
= None¶ Source of the record, e.g. ‘World Scientific’. Used across many fields.
-
subtitle
= None¶ Sub-title of the record, e.g. ‘A treatese on the universe’.
-
thesis
= None¶ Thesis information
[{ 'date': '', 'defense_date': '', 'institutions': [], 'degree_type': '', }]
-
title
= None¶ Title of the record, e.g. ‘Perturbative Renormalization of Neutron-Antineutron Operators’.
-
titles
= None¶ List of title structures.
-
urls
= None¶ URLs to splash page.
['http://hdl.handle.net/1885/10005']
-
Spiders¶
ALPHA¶
Spider for ALPHA.
-
class
hepcrawl.spiders.alpha_spider.
AlphaSpider
(source_file=None, *args, **kwargs)[source]¶ Alpha crawler Scrapes theses metadata from Alpha experiment web page. http://alpha.web.cern.ch/publications#thesis
- parse() iterates through every record on the html page and yields a HEPRecord.
Example usage: .. code-block:: console
scrapy crawl alpha -s “JSON_OUTPUT_DIR=tmp/” scrapy crawl alpha -a source_file=file://pwd/tests/responses/alpha/test_1.htm -s “JSON_OUTPUT_DIR=tmp/”Happy crawling!
Parses the line where there are data about the author(s)
arXiv.org¶
Spider for arXiv.
BASE¶
Spider for BASE.
-
class
hepcrawl.spiders.base_spider.
BaseSpider
(source_file=None, *args, **kwargs)[source]¶ BASE crawler Scrapes BASE metadata XML files one at a time. The actual files should be retrieved from BASE viat its OAI interface. The file can contain multiple records. This spider harvests only theses. It takes one BASE metadata record which are stored in an XML file.
- First a request is sent to parse_node() to look through the XML file and determine if it has direct link(s) to a fulltext pdf. (Actually it doesn’t recognize fulltexts; it’s happy when it sees a pdf of some kind.) calls: parse_node()
- 2a.If direct link exists, it will call build_item() to extract all desired
- data from the XML file. Data will be put to a HEPrecord item and sent to a pipeline for processing. calls: build_item()
- 2b.If no direct link exists, it will send a request to scrape_for_pdf() to
- follow links and extract the pdf url. It will then call build_item() to build HEPrecord. calls: scrape_for_pdf(), then build_item()
Example usage: .. code-block:: console
scrapy crawl BASE -a source_file=file://pwd/tests/responses/base/test_1.xml -s “JSON_OUTPUT_DIR=tmp/”Happy crawling!
Get the authors.
Probably there is only one author but it’s not necessarily in the creator element. If it’s only in the contributor element, it’s impossible to detect unless it’s explicitly declared as an author name.
-
static
get_urls_in_record
(node)[source]¶ Return all the different urls in the xml.
Urls might be stored in identifier, relation, or link element. Beware the strange “filename.jpg.pdf” urls.
-
parse_node
(response, node)[source]¶ Iterate through all the record nodes in the XML.
With each node it checks if direct link exists, and sends a request to scrape the direct link or calls build_item() to build the HEPrecord.
Brown¶
Spider for Brown University Digital Repository
-
class
hepcrawl.spiders.brown_spider.
BrownSpider
(source_file=None, *args, **kwargs)[source]¶ Brown crawler Scrapes theses metadata from Brown Digital Repository JSON file https://repository.library.brown.edu/api/collections/355/
Browse the dissertations: https://repository.library.brown.edu/studio/collections/id_355/
- parse() iterates through every record on the JSON file and yields a HEPRecord (or a request to scrape for the pdf file if link exists).
Example usage: .. code-block:: console
scrapy crawl brown -s “JSON_OUTPUT_DIR=tmp/” scrapy crawl brown -a source_file=file://pwd/tests/responses/brown/test_1.json -s “JSON_OUTPUT_DIR=tmp/”Happy crawling!
-
parse
(response)[source]¶ Go through every record in the JSON and. If link to splash page exists, go scrape. If not, create a record with the available data.
DNB¶
Spider for DNB Dissonline.
-
class
hepcrawl.spiders.dnb_spider.
DNBSpider
(source_file=None, *args, **kwargs)[source]¶ DNB crawler Scrapes Deutsche National Bibliotek metadata XML files one at a time. The actual files should be retrieved from DNB viat its OAI interface. The file can contain multiple records. This spider harvests only theses.
This spider takes DNB metadata records which are stored in an XML file.
- The spider will parse the local MARC21XML format file for record data
- If a link to the original repository splash page exists, parse_node will yield a request to scrape for abstract. This will only be done to a few selected repositories (at least for now).
- Finally a HEPRecord will be created in build_item.
Example usage: .. code-block:: console
scrapy crawl DNB -a source_file=file://pwd/tests/responses/dnb/test_1.xml -s “JSON_OUTPUT_DIR=tmp/”Happy crawling!
Gets the authors.
-
parse_node
(response, node)[source]¶ Iterate through all the record nodes in the XML.
With each node it checks if splash page link exists, and sends a request to scrape the abstract or calls build_item to build the HEPrecord.
Elsevier¶
Spider for Elsevier.
INFN¶
Spider for INFN.
-
class
hepcrawl.spiders.infn_spider.
InfnSpider
(source_file=None, year='2017', *args, **kwargs)[source]¶ INFN crawler Scrapes theses metadata from INFN web page. http://www.infn.it/thesis/index.php
- If not local html file given, get_list_file gets one using POST requests. Year is given as a argument, default is current year.
- parse_node() iterates through every record on the html page.
- If no pdf links are found, request to scrape the splash page is returned.
- In the end, a HEPRecord is built.
Example usage: .. code-block:: console
scrapy crawl infn scrapy crawl infn -a source_file=file://pwd/tests/responses/infn/test_1.html -s “JSON_OUTPUT_DIR=tmp/” scrapy crawl infn -a year=1999 -s “JSON_OUTPUT_DIR=tmp/”Happy crawling!
-
add_fft_file
(pdf_files, file_access, file_type)[source]¶ Create a structured dictionary to add to ‘files’ item.
Return authors dictionary
IOP¶
Spider for IOP.
-
class
hepcrawl.spiders.iop_spider.
IOPSpider
(zip_file=None, xml_file=None, pdf_files=None, *args, **kwargs)[source]¶ IOPSpider crawler.
This spider should first be able to harvest files from IOP STACKS (http://stacks.iop.org/Member/). Then it should scrape through the files and get the things we want.
XML files are in NLM PubMed format: http://www.ncbi.nlm.nih.gov/books/NBK3828/#publisherhelp.XML_Tag_Descriptions Examples: http://www.ncbi.nlm.nih.gov/books/NBK3828/#publisherhelp.Example_of_a_Standard_XML
- Fetch gzipped data packages from STACKS
- Scrape the XML files inside.
- Return valid JSON records.
You can also call this spider directly on gzip package or an XML file. If called without arguments, it will attempt to fetch files from STACKS.
Example usage: .. code-block:: console
scrapy crawl iop -a xml_file=file://pwd/tests/responses/iop/xml/test_standard.xml scrapy crawl iop -a zip_file=file://pwd/tests/responses/iop/packages/test.tar.gz -a xml_file=file://pwd/tests/responses/iop/xml/test_standard.xml scrapy crawl iop -a pdf_files=`pwd`/tests/responses/iop/pdf/ -a xml_file=file://pwd/tests/responses/iop/xml/test_standard.xmlfor JSON output, add -s “JSON_OUTPUT_DIR=tmp/” for logging, add -s “LOG_FILE=iop.log”
Happy crawling!
-
add_fft_file
(file_path, file_access, file_type)[source]¶ Create a structured dictionary and add to ‘files’ item.
MAGIC¶
Spider for MAGIC.
-
class
hepcrawl.spiders.magic_spider.
MagicSpider
(source_file=None, *args, **kwargs)[source]¶ MAGIC crawler Scrapes theses metadata from MAGIC telescope web page. https://magic.mpp.mpg.de/backend/publications/thesis
- parse_node will get thesis title, author and date from the listing.
- If link to the splash page exists, scrape_for_pdf will try to fetch the pdf link, abstract, and authors.
- build_item will build the HEPRecord.
Example usage: .. code-block:: console
scrapy crawl magic scrapy crawl magic -a source_file=file://pwd/tests/responses/magic/test_list.html -s “JSON_OUTPUT_DIR=tmp/”Happy crawling!
-
add_fft_file
(pdf_files, file_access, file_type)[source]¶ Create a structured dictionary and add to ‘files’ item.
Parses the line where there are data about the author(s)
Note that author surnames and given names are not comma separated, so split_fullname might get a wrong surname.
MIT¶
Spider for MIT.
-
class
hepcrawl.spiders.mit_spider.
MITSpider
(source_file=None, year='2017', *args, **kwargs)[source]¶ MIT crawler Scrapes theses metadata from DSpace@MIT (Dept. of Physics dissertations). http://dspace.mit.edu/handle/1721.1/7608/browse
- get_list_file makes post requests to get list of records as a html file. Defaults are to take the current year and 100 records per file.
- parse iterates through every record on the html page and yields a request to scrape full metadata.
- build_item builds the final HEPRecord.
Example usage: .. code-block:: console
scrapy crawl MIT scrapy crawl MIT -a year=1999 -s “JSON_OUTPUT_DIR=tmp/”Happy crawling!
-
add_fft_file
(pdf_files, file_access, file_type)[source]¶ Create a structured dictionary to add to ‘files’ item.
Return authors dictionary
PHENIX¶
Spider for PHENIX.
-
class
hepcrawl.spiders.phenix_spider.
PhenixSpider
(source_file=None, *args, **kwargs)[source]¶ PHENIX crawler Scrapes theses metadata from PHENIX experiment web page. http://www.phenix.bnl.gov/WWW/talk/theses.php
- parse() iterates through every record on the html page and yields a HEPRecord.
Example usage: .. code-block:: console
scrapy crawl phenix scrapy crawl phenix -a source_file=file://pwd/tests/responses/phenix/test_list.html -s “JSON_OUTPUT_DIR=tmp/”Happy crawling!
-
add_fft_file
(pdf_files, file_access, file_type)[source]¶ Create a structured dictionary and add to ‘files’ item.
Return authors dictionary
Philpapers.org¶
Spider for Philpapers.org
-
class
hepcrawl.spiders.phil_spider.
PhilSpider
(source_file=None, *args, **kwargs)[source]¶ Phil crawler Scrapes theses metadata from Philpapers.org JSON file.
- parse() iterates through every record on the JSON file and yields a HEPRecord (or a request to scrape for the pdf file if link exists).
Example usage: .. code-block:: console
scrapy crawl phil -s “JSON_OUTPUT_DIR=tmp/” scrapy crawl phil -a source_file=file://pwd/tests/responses/phil/test_thesis.json -s “JSON_OUTPUT_DIR=tmp/”Happy crawling!
Parses the line where there are data about the author(s).
Proceedings of Science¶
Spider for POS.
-
class
hepcrawl.spiders.pos_spider.
POSSpider
(source_file=None, **kwargs)[source]¶ POS/Sissa crawler.
Extracts from metadata: title, article-id, conf-acronym, authors, affiliations, publication-date, publisher, license, language, link
scrapy crawl PoS -a source_file=file://`pwd`/tests/responses/pos/sample_pos_record.xml
T2K¶
Spider for T2K.
-
class
hepcrawl.spiders.t2k_spider.
T2kSpider
(source_file=None, *args, **kwargs)[source]¶ T2K crawler Scrapes theses metadata from T2K experiment web page. http://www.t2k.org/docs/thesis
- parse_node will get thesis title, author and date from the listing.
- If link to the splash page exists, scrape_for_pdf will try to fetch the pdf link and possibly also the abstract.
- build_item will build the HEPRecord.
Example usage: .. code-block:: console
scrapy crawl t2k scrapy crawl t2k -a source_file=file://pwd/tests/responses/t2k/test_list.html -s “JSON_OUTPUT_DIR=tmp/” scrapy crawl t2k -a source_file=file://pwd/tests/responses/t2k/test_1.html -s “JSON_OUTPUT_DIR=tmp/”Happy crawling!
-
add_fft_file
(pdf_files, file_access, file_type)[source]¶ Create a structured dictionary and add to ‘files’ item.
Parses the line where there are data about the author(s)
Note that author surnames and given names are not comma separated, so split_fullname might get a wrong surname.
World Scientific¶
Spider for World Scientific.
-
class
hepcrawl.spiders.wsp_spider.
WorldScientificSpider
(package_path=None, ftp_folder='WSP', ftp_host=None, ftp_netrc=None, *args, **kwargs)[source]¶ World Scientific Proceedings crawler.
This spider connects to a given FTP hosts and downloads zip files with XML files for extraction into HEP records.
This means that it generates the URLs for Scrapy to crawl in a special way:
- First it connects to a FTP host and lists all the new ZIP files found on the remote server and downloads them to a designated local folder, using start_requests().
- Then the ZIP file is unpacked and it lists all the XML files found inside, via handle_package(). Note the callback from start_requests()
- Finally, now each XML file is parsed via parse_node().
To run a crawl, you need to pass FTP connection information via ftp_host and ftp_netrc:``
scrapy crawl WSP -a 'ftp_host=ftp.example.com' -a 'ftp_netrc=/path/to/netrc'
Happy crawling!