Deployment¶
Traditionally, deployment of scrapy projects are done using scrapyd package. This adds a HTTP API on top of scrapy to allow for adding/removing Scrapy projects and most importantly; scheduling crawls.
The easiest way to setup scrapy is to login on a machine, fork the hepcrawl repository, install it,
and use scrapyd-deploy
command from scrapyd-client
package to push the project to Scrapyd.
Install HEPCrawl¶
We will start with creating a Python virtual environment to install our packages:
mkvirtualenv hepcrawl
cdvirtualenv
mkdir src && cd src
Then proceed to install HEPCrawl on a remote server by cloning the sources.
$ git clone https://github.com/inspirehep/hepcrawl.git
$ cd hepcrawl
$ pip install .
This should install all dependencies, including Scrapyd.
Setup Scrapyd¶
Next, it is important that you setup your /etc/scrapyd/scrapyd.conf
file with correct paths
in order for Scrapyd to store internal dbs, items, logs etc.
For example:
[scrapyd]
eggs_dir = /opt/hepcrawl/var/eggs
logs_dir = /opt/hepcrawl/var/logs
items_dir = /opt/hepcrawl/var/items
dbs_dir = /opt/hepcrawl/var/dbs
See Scrapyd-documentation for more config options.
Run Scrapyd¶
Now you can run the Scrapyd server in a separate terminal:
(hepcrawl)$ scrapyd
Scrapyd runs by default on port 6800. You can for example setup an webserver proxy (e.g. with nginx or apache). In addition, we recommend using a task-runner like supervisord
to run the command as a daemon.
Deploy Scrapy project¶
To deploy the HEPcrawl project, simply enter the source folder and run scrapyd-deploy.
(hepcrawl)$ cdvirtualenv src/hepcrawl
(hepcrawl)$ scrapyd-deploy # assumes a Scrapy server running on port 6800
This reads the configuration scrapy.cfg to deploy and by default the configuration should be correct.
Schedule crawls¶
Schedule crawls using e.g. curl (or via the inspire-crawler package):
$ curl http://crawler.example.org:6800/schedule.json -d project=hepcrawl -d spider=WSP
Pushing to remote servers¶
You can also choose to push the scrapyd project to a remote server. For this to work
you need to edit the scrapy.cfg
in your Scrapy project sources and add your
remote server information:
[deploy:myserver]
url = http://crawler.example.org
project = hepcrawl
#username = scrapy
#password = secret
Finally deploy the egg of hepcrawl to the remote server:
$ scrapyd-deploy myserver
Install via PyPi¶
You can also install hepcrawl
from PyPi and use pip
to install the package (preferably in a virtual environment):
(hepcrawl)$ pip install hepcrawl
This will install all dependencies, including scrapyd
.
Enable Sentry¶
To enable sentry you need to install some packages:
pip install -e .[sentry]
And then add to your environment the variable APP_SENTRY_DSN
with the connection information.
APP_SENTRY_DSN="https://foo:bar@sentry.example.com/1" scrapyd
Note
If you have setup supervisord you can use the environment
config option to add variables.