..
    This file is part of hepcrawl.
    Copyright (C) 2016 CERN.

    hepcrawl is a free software; you can redistribute it and/or modify it
    under the terms of the Revised BSD License; see LICENSE file for
    more details.

.. currentmodule:: hepcrawl


Deployment
==========

Traditionally, deployment of `scrapy` projects are done using `scrapyd`_ package. This adds a HTTP API on top
of `scrapy` to allow for adding/removing Scrapy projects and most importantly; scheduling crawls.

The easiest way to setup scrapy is to login on a machine, fork the `hepcrawl`_ repository, install it,
and use ``scrapyd-deploy`` command from ``scrapyd-client`` package to push the project to Scrapyd.


Install HEPCrawl
----------------

We will start with creating a Python virtual environment to install our packages:


.. code-block:: console

    mkvirtualenv hepcrawl
    cdvirtualenv
    mkdir src && cd src


Then proceed to install HEPCrawl on a remote server by cloning the sources.

.. code-block:: console

    $ git clone https://github.com/inspirehep/hepcrawl.git
    $ cd hepcrawl
    $ pip install .


This should install all dependencies, including Scrapyd.

Setup Scrapyd
-------------

Next, it is important that you setup your ``/etc/scrapyd/scrapyd.conf`` file with correct paths
in order for Scrapyd to store internal dbs, items, logs etc.

For example:

.. code-block:: text

    [scrapyd]
    eggs_dir    = /opt/hepcrawl/var/eggs
    logs_dir    = /opt/hepcrawl/var/logs
    items_dir   = /opt/hepcrawl/var/items
    dbs_dir     = /opt/hepcrawl/var/dbs


See `Scrapyd-documentation`_ for more config options.


Run Scrapyd
-----------

Now you can run the Scrapyd server in a separate terminal:

.. code-block:: console

    (hepcrawl)$ scrapyd


Scrapyd runs by default on port 6800. You can for example setup an webserver proxy (e.g. with nginx or apache). In addition, we recommend using a task-runner like ``supervisord`` to run the command as a daemon.


Deploy Scrapy project
---------------------

To deploy the HEPcrawl project, simply enter the source folder and run `scrapyd-deploy`.

.. code-block:: console

    (hepcrawl)$ cdvirtualenv src/hepcrawl
    (hepcrawl)$ scrapyd-deploy    # assumes a Scrapy server running on port 6800


This reads the configuration `scrapy.cfg` to deploy and by default the configuration should be correct.


Schedule crawls
---------------

Schedule crawls using e.g. curl (or via the `inspire-crawler`_ package):

.. code-block:: console

    $ curl http://crawler.example.org:6800/schedule.json -d project=hepcrawl -d spider=WSP


Pushing to remote servers
-------------------------

You can also choose to push the scrapyd project to a remote server. For this to work
you need to edit the ``scrapy.cfg`` in your Scrapy project sources and add your
remote server information:

.. code-block:: text

    [deploy:myserver]
    url = http://crawler.example.org
    project = hepcrawl
    #username = scrapy
    #password = secret


Finally deploy the egg of hepcrawl to the remote server:

.. code-block:: console

    $ scrapyd-deploy myserver



Install via PyPi
----------------

You can also install ``hepcrawl`` from PyPi and use ``pip`` to install the package (preferably in a virtual environment):

.. code-block:: console

    (hepcrawl)$ pip install hepcrawl

This will install all dependencies, including ``scrapyd``.


Enable Sentry
-------------

To enable sentry you need to install some packages:

.. code-block:: console

    pip install -e .[sentry]


And then add to your environment the variable ``APP_SENTRY_DSN`` with the connection information.

.. code-block:: console

    APP_SENTRY_DSN="https://foo:bar@sentry.example.com/1" scrapyd


.. note::

    If you have setup `supervisord` you can use the ``environment`` config option to add variables.



Known issues
============

Sentry integration with Python 2.7.9
------------------------------------

You need to install our fork of Raven:

.. code-block:: console

    pip install git+https://github.com/inspirehep/raven-python@master#egg=raven-python==5.1.1.dev20160118


.. _hepcrawl: https://github.com/inspirehep/hepcrawl
.. _scrapyd: http://scrapyd.readthedocs.io/
.. _Scrapyd-documentation: http://scrapyd.readthedocs.io/en/latest/config.html?highlight=database#configuration-file
.. _inspire-crawler: http://pythonhosted.org/inspire-crawler