Usage¶
After installing inspire-crawler, there is already a signal receiver attached to the invenio-oaiharvester module oaiharvest_finished signal.
This will execute the function inspire_crawler.receivers.receive_oaiharvest_job()
which will check
for certain arguments (spider name and workflow name) and schedule a crawl job.
Basically you just need to schedule crawls with inspire_crawler.tasks.schedule_crawl()
and required
arguments:
- spider: name of spider to execute
- workflow: name of workflow to execute when receiving crawled items
If you want to hook in other ways to automatically schedule crawler jobs, you can for example use CELERYBEAT_SCHEDULE in your Flask configuration:
CELERYBEAT_SCHEDULE = {
# Crawl World Scientific every Sunday at 2 AM
'world-scientific-sunday': {
'task': 'inspire_crawler.tasks.schedule_crawl',
'schedule': crontab(minute=0, hour=2, day_of_week=0),
'kwargs': {
"spider": "WSP",
"workflow": "my_ingestion_workflow",
"ftp_host": "ftp.example.com",
"ftp_netrc": "/some/folder/netrc"
}
}
}
Note
You need to provide the arguments spider
and workflow
alongside any other
spider arguments.