API Docs¶
Tasks API¶
Celery tasks for dealing with crawler.
-
(task)
inspire_crawler.tasks.submit_results(job_id, errors, log_file, results_uri, results_data=None)[source]¶ Receive the submission of the results of a crawl job.
Then it spawns the appropiate workflow according to whichever workflow the crawl job specifies.
Parameters: - job_id – Id of the crawler job.
- errors – Errors that happened, if any (seems ambiguous)
- log_file – Path to the log file of the crawler job.
- results_uri – URI to the file containing the results of the crawl job, namely the records extracted.
- results_data – Optional data payload with the results list, to skip retrieving them from the results_uri, useful for slow or unreliable storages.
-
(task)
inspire_crawler.tasks.schedule_crawl(spider, workflow, **kwargs)[source]¶ Schedule a crawl using configuration from the workflow objects.
-
(task)
inspire_crawler.tasks.submit_results(job_id, errors, log_file, results_uri, results_data=None)[source] Receive the submission of the results of a crawl job.
Then it spawns the appropiate workflow according to whichever workflow the crawl job specifies.
Parameters: - job_id – Id of the crawler job.
- errors – Errors that happened, if any (seems ambiguous)
- log_file – Path to the log file of the crawler job.
- results_uri – URI to the file containing the results of the crawl job, namely the records extracted.
- results_data – Optional data payload with the results list, to skip retrieving them from the results_uri, useful for slow or unreliable storages.
Signal receivers¶
Configuration for crawler integration.
Configuration¶
Configuration for crawler integration.
-
inspire_crawler.config.CRAWLER_DATA_TYPE= 'hep'¶ WorkflowObject data_type to set to all workflow objects.
-
inspire_crawler.config.CRAWLER_HOST_URL= 'http://localhost:6800'¶ URL to Scrapyd HTTP server.
-
inspire_crawler.config.CRAWLER_PROJECT= 'hepcrawl'¶ Scrapy project name to schedule crawls for.
-
inspire_crawler.config.CRAWLER_SETTINGS= {'API_PIPELINE_TASK_ENDPOINT_DEFAULT': 'inspire_crawler.tasks.submit_results', 'API_PIPELINE_URL': 'http://localhost:5555/api/task/async-apply'}¶ Dictionary of settings to add to crawlers.
By default set to flower tasks HTTP API and the standard task to be called with the results of the harvesting.
-
inspire_crawler.config.CRAWLER_SPIDER_ARGUMENTS= {}¶ Add any spider arguments to be passed when scheduling tasks.
For example for spider myspider:
{ 'myspider': {'somearg': 'foo'} }
You can also pass arguments directly to the scheduler with kwargs.
Models¶
Models for crawler integration.
-
class
inspire_crawler.models.CrawlerJob(**kwargs)[source]¶ Keeps track of submitted crawler jobs.
-
classmethod
create(job_id, spider, workflow, results=None, logs=None, status=<JobStatus.PENDING: 'pending'>)[source]¶ Create a new entry for a scheduled crawler job.
-
id¶
-
job_id¶
-
logs¶
-
results¶
-
scheduled¶
-
spider¶
-
status¶
-
workflow¶
-
classmethod