UsageΒΆ

After installing inspire-crawler, there is already a signal receiver attached to the invenio-oaiharvester module oaiharvest_finished signal.

This will execute the function inspire_crawler.receivers.receive_oaiharvest_job() which will check for certain arguments (spider name and workflow name) and schedule a crawl job.

Basically you just need to schedule crawls with inspire_crawler.tasks.schedule_crawl() and required arguments:

  • spider: name of spider to execute
  • workflow: name of workflow to execute when receiving crawled items

If you want to hook in other ways to automatically schedule crawler jobs, you can for example use CELERYBEAT_SCHEDULE in your Flask configuration:

CELERYBEAT_SCHEDULE = {
  # Crawl World Scientific every Sunday at 2 AM
  'world-scientific-sunday': {
    'task': 'inspire_crawler.tasks.schedule_crawl',
    'schedule': crontab(minute=0, hour=2, day_of_week=0),
    'kwargs': {
        "spider": "WSP",
        "workflow": "my_ingestion_workflow",
        "ftp_host": "ftp.example.com",
        "ftp_netrc": "/some/folder/netrc"
    }
  }
}

Note

You need to provide the arguments spider and workflow alongside any other spider arguments.