API Docs

Tasks API

Celery tasks for dealing with crawler.

(task)inspire_crawler.tasks.submit_results(job_id, errors, log_file, results_uri, results_data=None)[source]

Receive the submission of the results of a crawl job.

Then it spawns the appropiate workflow according to whichever workflow the crawl job specifies.

Parameters:
  • job_id – Id of the crawler job.
  • errors – Errors that happened, if any (seems ambiguous)
  • log_file – Path to the log file of the crawler job.
  • results_uri – URI to the file containing the results of the crawl job, namely the records extracted.
  • results_data – Optional data payload with the results list, to skip retrieving them from the results_uri, useful for slow or unreliable storages.
(task)inspire_crawler.tasks.schedule_crawl(spider, workflow, **kwargs)[source]

Schedule a crawl using configuration from the workflow objects.

(task)inspire_crawler.tasks.submit_results(job_id, errors, log_file, results_uri, results_data=None)[source]

Receive the submission of the results of a crawl job.

Then it spawns the appropiate workflow according to whichever workflow the crawl job specifies.

Parameters:
  • job_id – Id of the crawler job.
  • errors – Errors that happened, if any (seems ambiguous)
  • log_file – Path to the log file of the crawler job.
  • results_uri – URI to the file containing the results of the crawl job, namely the records extracted.
  • results_data – Optional data payload with the results list, to skip retrieving them from the results_uri, useful for slow or unreliable storages.

Signal receivers

Configuration for crawler integration.

inspire_crawler.receivers.receive_oaiharvest_job(request, records, name, **kwargs)[source]

Receive a list of harvested OAI-PMH records and schedule crawls.

Configuration

Configuration for crawler integration.

inspire_crawler.config.CRAWLER_DATA_TYPE = 'hep'

WorkflowObject data_type to set to all workflow objects.

inspire_crawler.config.CRAWLER_HOST_URL = 'http://localhost:6800'

URL to Scrapyd HTTP server.

inspire_crawler.config.CRAWLER_PROJECT = 'hepcrawl'

Scrapy project name to schedule crawls for.

inspire_crawler.config.CRAWLER_SETTINGS = {'API_PIPELINE_TASK_ENDPOINT_DEFAULT': 'inspire_crawler.tasks.submit_results', 'API_PIPELINE_URL': 'http://localhost:5555/api/task/async-apply'}

Dictionary of settings to add to crawlers.

By default set to flower tasks HTTP API and the standard task to be called with the results of the harvesting.

inspire_crawler.config.CRAWLER_SPIDER_ARGUMENTS = {}

Add any spider arguments to be passed when scheduling tasks.

For example for spider myspider:

{
    'myspider': {'somearg': 'foo'}
}

You can also pass arguments directly to the scheduler with kwargs.

Models

Models for crawler integration.

class inspire_crawler.models.CrawlerJob(**kwargs)[source]

Keeps track of submitted crawler jobs.

classmethod create(job_id, spider, workflow, results=None, logs=None, status=<JobStatus.PENDING: 'pending'>)[source]

Create a new entry for a scheduled crawler job.

classmethod get_by_job(job_id)[source]

Get a row by Job UUID.

id
job_id
logs
results
save()[source]

Save object to persistent storage.

scheduled
spider
status
workflow
class inspire_crawler.models.CrawlerWorkflowObject(**kwargs)[source]

Relation between a job and workflow objects.

job_id
object_id

Errors

Custom errors to watch out for.

exception inspire_crawler.errors.CrawlerError[source]

Something went wrong while crawling.

exception inspire_crawler.errors.CrawlerInvalidResultsPath[source]

Problem getting results from crawler.

exception inspire_crawler.errors.CrawlerJobError[source]

There was an error with a job.

exception inspire_crawler.errors.CrawlerJobNotExistError[source]

Problem getting crawler job.

exception inspire_crawler.errors.CrawlerScheduleError[source]

Problem scheduling crawler.