API Docs¶

Tasks API¶

Celery tasks for dealing with crawler.

(task)inspire_crawler.tasks.submit_results(job_id, errors, log_file, results_uri, results_data=None)[source]¶

Receive the submission of the results of a crawl job.

Then it spawns the appropiate workflow according to whichever workflow the crawl job specifies.

Parameters:

job_id – Id of the crawler job.
errors – Errors that happened, if any (seems ambiguous)
log_file – Path to the log file of the crawler job.
results_uri – URI to the file containing the results of the crawl job, namely the records extracted.
results_data – Optional data payload with the results list, to skip retrieving them from the results_uri, useful for slow or unreliable storages.

(task)inspire_crawler.tasks.schedule_crawl(spider, workflow, **kwargs)[source]¶: Schedule a crawl using configuration from the workflow objects.

(task)inspire_crawler.tasks.submit_results(job_id, errors, log_file, results_uri, results_data=None)[source]

Receive the submission of the results of a crawl job.

Then it spawns the appropiate workflow according to whichever workflow the crawl job specifies.

Parameters:

job_id – Id of the crawler job.
errors – Errors that happened, if any (seems ambiguous)
log_file – Path to the log file of the crawler job.
results_uri – URI to the file containing the results of the crawl job, namely the records extracted.
results_data – Optional data payload with the results list, to skip retrieving them from the results_uri, useful for slow or unreliable storages.

Signal receivers¶

Configuration for crawler integration.

inspire_crawler.receivers.receive_oaiharvest_job(request, records, name, **kwargs)[source]¶: Receive a list of harvested OAI-PMH records and schedule crawls.

Configuration¶

Configuration for crawler integration.

inspire_crawler.config.CRAWLER_DATA_TYPE = 'hep'¶: WorkflowObject data_type to set to all workflow objects.

inspire_crawler.config.CRAWLER_HOST_URL = 'http://localhost:6800'¶: URL to Scrapyd HTTP server.

inspire_crawler.config.CRAWLER_PROJECT = 'hepcrawl'¶: Scrapy project name to schedule crawls for.

inspire_crawler.config.CRAWLER_SETTINGS = {'API_PIPELINE_TASK_ENDPOINT_DEFAULT': 'inspire_crawler.tasks.submit_results', 'API_PIPELINE_URL': 'http://localhost:5555/api/task/async-apply'}¶

Dictionary of settings to add to crawlers.

By default set to flower tasks HTTP API and the standard task to be called with the results of the harvesting.

inspire_crawler.config.CRAWLER_SPIDER_ARGUMENTS = {}¶

Add any spider arguments to be passed when scheduling tasks.

For example for spider myspider:

{
    'myspider': {'somearg': 'foo'}
}

You can also pass arguments directly to the scheduler with kwargs.

Models¶

Models for crawler integration.

class inspire_crawler.models.CrawlerJob(**kwargs)[source]¶

Keeps track of submitted crawler jobs.

classmethod create(job_id, spider, workflow, results=None, logs=None, status=<JobStatus.PENDING: 'pending'>)[source]¶: Create a new entry for a scheduled crawler job.

classmethod get_by_job(job_id)[source]¶: Get a row by Job UUID.

id¶

job_id¶

logs¶

results¶

save()[source]¶: Save object to persistent storage.

scheduled¶

spider¶

status¶

workflow¶

class inspire_crawler.models.CrawlerWorkflowObject(**kwargs)[source]¶

Relation between a job and workflow objects.

job_id¶

object_id¶

Errors¶

Custom errors to watch out for.

exception inspire_crawler.errors.CrawlerError[source]¶: Something went wrong while crawling.

exception inspire_crawler.errors.CrawlerInvalidResultsPath[source]¶: Problem getting results from crawler.

exception inspire_crawler.errors.CrawlerJobError[source]¶: There was an error with a job.

exception inspire_crawler.errors.CrawlerJobNotExistError[source]¶: Problem getting crawler job.

exception inspire_crawler.errors.CrawlerScheduleError[source]¶: Problem scheduling crawler.

API Docs¶

Tasks API¶

Signal receivers¶

Configuration¶

Models¶

Errors¶

inspire-crawler

Navigation

Related Topics