API Docs¶
Tasks API¶
Celery tasks for dealing with crawler.
-
(task)
inspire_crawler.tasks.
submit_results
(job_id, errors, log_file, results_uri, results_data=None)[source]¶ Receive the submission of the results of a crawl job.
Then it spawns the appropiate workflow according to whichever workflow the crawl job specifies.
Parameters: - job_id – Id of the crawler job.
- errors – Errors that happened, if any (seems ambiguous)
- log_file – Path to the log file of the crawler job.
- results_uri – URI to the file containing the results of the crawl job, namely the records extracted.
- results_data – Optional data payload with the results list, to skip retrieving them from the results_uri, useful for slow or unreliable storages.
-
(task)
inspire_crawler.tasks.
schedule_crawl
(spider, workflow, **kwargs)[source]¶ Schedule a crawl using configuration from the workflow objects.
-
(task)
inspire_crawler.tasks.
submit_results
(job_id, errors, log_file, results_uri, results_data=None)[source] Receive the submission of the results of a crawl job.
Then it spawns the appropiate workflow according to whichever workflow the crawl job specifies.
Parameters: - job_id – Id of the crawler job.
- errors – Errors that happened, if any (seems ambiguous)
- log_file – Path to the log file of the crawler job.
- results_uri – URI to the file containing the results of the crawl job, namely the records extracted.
- results_data – Optional data payload with the results list, to skip retrieving them from the results_uri, useful for slow or unreliable storages.
Signal receivers¶
Configuration for crawler integration.
Configuration¶
Configuration for crawler integration.
-
inspire_crawler.config.
CRAWLER_DATA_TYPE
= 'hep'¶ WorkflowObject data_type to set to all workflow objects.
-
inspire_crawler.config.
CRAWLER_HOST_URL
= 'http://localhost:6800'¶ URL to Scrapyd HTTP server.
-
inspire_crawler.config.
CRAWLER_PROJECT
= 'hepcrawl'¶ Scrapy project name to schedule crawls for.
-
inspire_crawler.config.
CRAWLER_SETTINGS
= {'API_PIPELINE_TASK_ENDPOINT_DEFAULT': 'inspire_crawler.tasks.submit_results', 'API_PIPELINE_URL': 'http://localhost:5555/api/task/async-apply'}¶ Dictionary of settings to add to crawlers.
By default set to flower tasks HTTP API and the standard task to be called with the results of the harvesting.
-
inspire_crawler.config.
CRAWLER_SPIDER_ARGUMENTS
= {}¶ Add any spider arguments to be passed when scheduling tasks.
For example for spider myspider:
{ 'myspider': {'somearg': 'foo'} }
You can also pass arguments directly to the scheduler with kwargs.
Models¶
Models for crawler integration.
-
class
inspire_crawler.models.
CrawlerJob
(**kwargs)[source]¶ Keeps track of submitted crawler jobs.
-
classmethod
create
(job_id, spider, workflow, results=None, logs=None, status=<JobStatus.PENDING: 'pending'>)[source]¶ Create a new entry for a scheduled crawler job.
-
id
¶
-
job_id
¶
-
logs
¶
-
results
¶
-
scheduled
¶
-
spider
¶
-
status
¶
-
workflow
¶
-
classmethod