.. This file is part of hepcrawl. Copyright (C) 2016 CERN. hepcrawl is a free software; you can redistribute it and/or modify it under the terms of the Revised BSD License; see LICENSE file for more details. .. currentmodule:: hepcrawl Deployment ========== Traditionally, deployment of `scrapy` projects are done using `scrapyd`_ package. This adds a HTTP API on top of `scrapy` to allow for adding/removing Scrapy projects and most importantly; scheduling crawls. The easiest way to setup scrapy is to login on a machine, fork the `hepcrawl`_ repository, install it, and use ``scrapyd-deploy`` command from ``scrapyd-client`` package to push the project to Scrapyd. Install HEPCrawl ---------------- We will start with creating a Python virtual environment to install our packages: .. code-block:: console mkvirtualenv hepcrawl cdvirtualenv mkdir src && cd src Then proceed to install HEPCrawl on a remote server by cloning the sources. .. code-block:: console $ git clone https://github.com/inspirehep/hepcrawl.git $ cd hepcrawl $ pip install . This should install all dependencies, including Scrapyd. Setup Scrapyd ------------- Next, it is important that you setup your ``/etc/scrapyd/scrapyd.conf`` file with correct paths in order for Scrapyd to store internal dbs, items, logs etc. For example: .. code-block:: text [scrapyd] eggs_dir = /opt/hepcrawl/var/eggs logs_dir = /opt/hepcrawl/var/logs items_dir = /opt/hepcrawl/var/items dbs_dir = /opt/hepcrawl/var/dbs See `Scrapyd-documentation`_ for more config options. Run Scrapyd ----------- Now you can run the Scrapyd server in a separate terminal: .. code-block:: console (hepcrawl)$ scrapyd Scrapyd runs by default on port 6800. You can for example setup an webserver proxy (e.g. with nginx or apache). In addition, we recommend using a task-runner like ``supervisord`` to run the command as a daemon. Deploy Scrapy project --------------------- To deploy the HEPcrawl project, simply enter the source folder and run `scrapyd-deploy`. .. code-block:: console (hepcrawl)$ cdvirtualenv src/hepcrawl (hepcrawl)$ scrapyd-deploy # assumes a Scrapy server running on port 6800 This reads the configuration `scrapy.cfg` to deploy and by default the configuration should be correct. Schedule crawls --------------- Schedule crawls using e.g. curl (or via the `inspire-crawler`_ package): .. code-block:: console $ curl http://crawler.example.org:6800/schedule.json -d project=hepcrawl -d spider=WSP Pushing to remote servers ------------------------- You can also choose to push the scrapyd project to a remote server. For this to work you need to edit the ``scrapy.cfg`` in your Scrapy project sources and add your remote server information: .. code-block:: text [deploy:myserver] url = http://crawler.example.org project = hepcrawl #username = scrapy #password = secret Finally deploy the egg of hepcrawl to the remote server: .. code-block:: console $ scrapyd-deploy myserver Install via PyPi ---------------- You can also install ``hepcrawl`` from PyPi and use ``pip`` to install the package (preferably in a virtual environment): .. code-block:: console (hepcrawl)$ pip install hepcrawl This will install all dependencies, including ``scrapyd``. Enable Sentry ------------- To enable sentry you need to install some packages: .. code-block:: console pip install -e .[sentry] And then add to your environment the variable ``APP_SENTRY_DSN`` with the connection information. .. code-block:: console APP_SENTRY_DSN="https://foo:bar@sentry.example.com/1" scrapyd .. note:: If you have setup `supervisord` you can use the ``environment`` config option to add variables. Known issues ============ Sentry integration with Python 2.7.9 ------------------------------------ You need to install our fork of Raven: .. code-block:: console pip install git+https://github.com/inspirehep/raven-python@master#egg=raven-python==5.1.1.dev20160118 .. _hepcrawl: https://github.com/inspirehep/hepcrawl .. _scrapyd: http://scrapyd.readthedocs.io/ .. _Scrapyd-documentation: http://scrapyd.readthedocs.io/en/latest/config.html?highlight=database#configuration-file .. _inspire-crawler: http://pythonhosted.org/inspire-crawler