simple-requests

Asynchronous requests in Python without thinking about it.

Goal¶

The goal of this library is to allow you to get the performance benefit of asynchronous requests, without needing to use any asynchronous coding paradigms. It is built on gevent and requests.

If you like getting your hands dirty, the gevent.pool.Pool and requests.Session that drives the main object is readily available for you to tinker with as much as you’d like.

Features¶

There is also some added functionality not available out-of-the-box from the base libraries:

Request thresholding
Automatic retry on failure, with three different retry strategies included that focus on different applications (big server scrape, small server scrape, API)
Lazy loading and minimal object caching to keep the memory footprint down
Checks with smart defaults to avoid killing your system (e.g. opening too many connections at once) or the remote server (e.g. making too many requests per second)

Usage¶

from simple_requests import Requests

# Creates a session and thread pool
requests = Requests()

# Sends one simple request; the response is returned synchronously.
login_response = requests.one('http://cat-videos.net/login?user=fanatic&password=c4tl0v3r')

# Cookies are maintained in this instance of Requests, so subsequent requests
# will still be logged-in.
profile_urls = [
    'http://cat-videos.net/profile/mookie',
    'http://cat-videos.net/profile/kenneth',
    'http://cat-videos.net/profile/itchy' ]

# Asynchronously send all the requests for profile pages
for profile_response in requests.swarm(profile_urls):

    # Asynchronously send requests for each link found on the profile pages
    # These requests take precedence over those in the outer loop to minimize overall waiting
    # Order doesn't matter this time either, so turn that off for a slight performance gain
    for friends_response in requests.swarm(profile_response.links, maintainOrder = False):

        # Do something intelligent with the responses, like using
        # regex to parse the HTML (see http://stackoverflow.com/a/1732454)
        friends_response.html

API¶

class simple_requests.Requests(concurrent=2, minSecondsBetweenRequests=0.15, defaultTimeout=None, retryStrategy=Strict(), responsePreprocessor=ResponsePreprocessor())¶

Creates a session and a thread pool. The intention is to have one instance per server that you're hitting at the same time.

Parameters:

Parameters:	concurrent – (optional) The maximum number of concurrent requests allowed for this instance. minSecondsBetweenRequests – (optional) Every request is guaranteed to be separated by at least this many seconds. defaultTimeout – (optional) Stop waiting after the server is unresponsive for this many seconds. See the requests docs for an exact definition of what constitutes a timeout. Should a timeout occur, the request will either be retried or an exception will be raised (depending on `retryStrategy` and `responsePreprocessor`). retryStrategy – (optional) An instance of `RetryStrategy` (or subclass). Allows you to define if and how a request should be retried on failure. The default implementation (`Strict`) retries failed requests twice, for HTTP errors only, with at least 2 seconds between each subsequent request. Two other implementations are included: `Lenient` (good for really small servers, perhaps hosted out of somebody’s home), and `Backoff` (good for APIs). responsePreprocessor – (optional) An instance of `ResponsePreprocessor` (or subclass). Useful if you need to override the default handling of successful responses and/or failed responses.

concurrent – (optional) The maximum number of concurrent requests allowed for this instance.
minSecondsBetweenRequests – (optional) Every request is guaranteed to be separated by at least this many seconds.
defaultTimeout – (optional) Stop waiting after the server is unresponsive for this many seconds. See the requests docs for an exact definition of what constitutes a timeout. Should a timeout occur, the request will either be retried or an exception will be raised (depending on retryStrategy and responsePreprocessor).
retryStrategy – (optional) An instance of RetryStrategy (or subclass). Allows you to define if and how a request should be retried on failure. The default implementation (Strict) retries failed requests twice, for HTTP errors only, with at least 2 seconds between each subsequent request. Two other implementations are included: Lenient (good for really small servers, perhaps hosted out of somebody’s home), and Backoff (good for APIs).
responsePreprocessor – (optional) An instance of ResponsePreprocessor (or subclass). Useful if you need to override the default handling of successful responses and/or failed responses.

session¶: An instance of requests.Session that manages things like maintaining cookies between requests.

pool¶: An instance of gevent.pool.Pool that manages things like maintaining the number of concurrent requests. Changes to this object should be done before any requests are sent.

one(request, responsePreprocessor=None, bundleParam=None)¶

Execute one request synchronously.

Since this request is synchronous, it takes precedence over any other each() or swarm() calls which may still be processing.

Parameters:	request – A `str`, `requests.Request`, or `requests.PreparedRequest`. `str` (or any other `basestring`) will be executed as an HTTP `GET`. responsePreprocessor – (optional) Override the default preprocessor for this request only. bundleParam – (optional) Attach arbitrary data to the `Bundle` generated by this method.
Returns:	A `requests.Response`.

swarm(iterable, maintainOrder=True, responsePreprocessor=None, bundleParam=None)¶

Execute each request asynchronously.

Subsequent calls to each(), swarm() or one() on the same Requests instance will be prioritized over earlier calls. This is generally aligned with how responses are processed (one response is inspected, which leads to more requests whose responses are inspected… etc.)

This method will try hard to finish executing all requests, even if the iterator has fallen out of scope, or an exception was raised, or even if the execution of the main module is finished. Use the stop() method to cancel any pending requests and/or kill executing requests.

Parameters:

Parameters:	iterable – A generator, list, tuple, dictionary, or any other iterable object containing any combination of `str`, `requests.Request`, or `requests.PreparedRequest`. `str` (or any other `basestring`) will be executed as an HTTP `GET`. maintainOrder – (optional) By default, the returned responses are guaranteed to be in the same order as the requests. If this is not important to you, set this to `False` for a performance gain. responsePreprocessor – (optional) Override the default preprocessor for these requests only. bundleParam – (optional) Attach arbitrary data to all the `Bundle`s generated by this method.
Returns:	A `ResponseIterator` that may be iterated over to get a `requests.Response`.

iterable – A generator, list, tuple, dictionary, or any other iterable object containing any combination of str, requests.Request, or requests.PreparedRequest. str (or any other basestring) will be executed as an HTTP GET.
maintainOrder – (optional) By default, the returned responses are guaranteed to be in the same order as the requests. If this is not important to you, set this to False for a performance gain.
responsePreprocessor – (optional) Override the default preprocessor for these requests only.
bundleParam – (optional) Attach arbitrary data to all the Bundles generated by this method.

Returns:

A ResponseIterator that may be iterated over to get a requests.Response.

each(iterable, mapToRequest=(lambda i: i.request), maintainOrder=False, responsePreprocessor=None, bundleParam=None)¶

Execute a request for each object.

Parameters:

Parameters:	iterable – A generator, list, tuple, dictionary, or any other iterable. mapToRequest – (optional) By default, the `request` attribute (or property) is used. If such an attribute does not exist (or some other behaviour is desired), this function will be used to get a request for each object in the iterable. The returned response must be a `str`, `requests.Request`, or `requests.PreparedRequest`. maintainOrder – (optional) By default, the order is not maintained between the iterable and the responses. Set this to `True` to guarantee that the order is maintained. responsePreprocessor – (optional) Override the default preprocessor for these requests only. bundleParam – (optional) Attach arbitrary data to all the `Bundle`s generated by this method.
Returns:	A `ResponseIterator` that may be iterated over to get a (`requests.Response`, object) tuple for each object in the given iterable.

iterable – A generator, list, tuple, dictionary, or any other iterable.
mapToRequest – (optional) By default, the request attribute (or property) is used. If such an attribute does not exist (or some other behaviour is desired), this function will be used to get a request for each object in the iterable. The returned response must be a str, requests.Request, or requests.PreparedRequest.
maintainOrder – (optional) By default, the order is not maintained between the iterable and the responses. Set this to True to guarantee that the order is maintained.
responsePreprocessor – (optional) Override the default preprocessor for these requests only.
bundleParam – (optional) Attach arbitrary data to all the Bundles generated by this method.

Returns:

A ResponseIterator that may be iterated over to get a (requests.Response, object) tuple for each object in the given iterable.

stop(killExecuting=True)¶

Stop the execution of requests early.

The swarm() and each() methods will try hard to finish executing all requests, even if the iterator has fallen out of scope, or an exception was raised, or even if the execution of the main module is finished.

Use this method to cancel all pending requests.

Parameters:	killExecuting – (optional) In addition to canceling pending requests, kill any currently-executing requests so that the response will not be returned. While this has the benefit of guaranteeing that there will be no more activity once the method returns, it means that it is undeterminable whether any current requests succeeded, failed, or had any server side-effects.

class simple_requests.Bundle¶

This class is used internally to "bundle up" a request, response, and associated object (if it exists).

request¶: An instance of requests.Request.

response¶: An instance of requests.Response. May be None if the request has not yet been sent or the raised an exception while being prepared.

exception¶: An instance of Exception. Will be None if the request has not yet been sent or was executed successfully.

traceback¶: An instance of traceback that specifies where an exception was raised. Will be None if the request has not yet been sent or was executed successfully.

param¶: Arbitrary data that may be passed-in as an optional parameter into one(), swarm() or each(). Useful when overriding something core and you need to pass in data that may be different per method call.

obj¶: The object associated with the request, if each() was used. None otherwise.

hasobj¶: True if obj is set to an associated object, False otherwise. Necessary in case the object itself is None.

class simple_requests.ResponsePreprocessor¶

Default implementation of how responses are preprocessed.

By default, successful responses are returned and errors are raised. Whatever is returned by success() and error() is what gets returned by one() and the iterator of swarm().

There are several reasons you may want to override the default implementation:

Don’t raise an exception on HTTP errors
Add a side-effect, such as writing all responses to an archive file
Responses in your application must always be pre-processed in a specific way
More...

success(bundle)¶

Handle a successful request.

Parameters:	bundle – A `Bundle`. The `response` attribute will be populated, and `exception` will be `None`.

error(bundle)¶

Handle a failed request.

Parameters:	bundle – A `Bundle`. The `exception` attribute will be populated, but `response` may be `None` (depending on the kind of error).

class simple_requests.Strict¶

A good default RetryStrategy.

Retries up to two times, with 2 seconds between each attempt.

Only HTTP errors are retried.

class simple_requests.Lenient¶

A RetryStrategy designed for very small servers.

Small servers are expected to go down every now and then, so this strategy retries requests that returned an HTTP error up to 4 times, with a full minute between each attempt.

Other (non-HTTP) errors are retried only once after 60 seconds.

class simple_requests.Backoff¶

A RetryStrategy designed for APIs.

Since APIs are expected to work, this implementation retries requests that return HTTP errors many times, with an exponentially-increasing time in between each request (capped at 60 seconds).

Other (non-HTTP) errors are retried only once after 10 seconds.

class simple_requests.HTTPError(response)¶

Encapsulates HTTP errors (status codes in the 400s and 500s).

code¶: Status code for this error.

msg¶: The reason (associated with the status code).

response¶: The instance of requests.Response which triggered the error.

function simple_requests.patch(allowIncompleteResponses=False, avoidTooManyOpenFiles=False)¶

Each argument can be True to perform the patch, or False to "unpatch".

Due to the nature of the problems these patches are intended to fix, misbehaving servers, they are very difficult to test. Please take this into consideration when using them in a production environment.

allowIncompleteResponses¶

Affects httplib. Allows continued processing when the actual amount of data is different than the amount specified ahead of time by the server (using the content-length header). What are the possible scenarios?

The server calculated the content-length incorrectly, and you actually got all the data. This patch will fix this scenario. This happens surprisingly often.
You didn't get the full payload. In this case, you'll likely get a ContentDecodingError if the response has been compressed, or truncated content (which may be chopped in the middle of a multi-byte character, resulting in a UnicodeError). If you're parsing structured data, like XML or JSON, it will almost certainly be invalid. So, in many cases, you'll get an error raised anyways.

Note that this patch affects all python http connections, even those outside of simple-requests, requests, and urllib3.

avoidTooManyOpenFiles¶

Affects urllib3. simple-requests ultimately uses the extremely clever urllib3 library to manage connection pooling. This library implements a leaky bucket paradigm for handling pools. What does this mean? Some number of connections are kept open with each server to avoid the costly operation of opening a new one. Should the maximum number of connections be exceeded, for whatever reason, a new connection will be created. This bonus connection will be discarded once it's job is done; it will not go back into the pool. This is considered to be a good compromise between performance and number of open connections (which count as open files).

There are some scenarios whereby the number of open connections keeps increasing faster than they can be closed, and eventually, you get socket.error: [Errno 24] Too many open files. After this point, it's probably unrecoverable.

How many open connections can you have before this is a problem? On many systems, around 900.

This patch will add a speed-limit to the creation of new urllib3 connections. As long as there are fewer than 200 open connections, new ones will be created immediately. After that point, new connections are opened at a rate of 1 every 10 seconds. Once the number of open connections drops to below 200, they are created immediately again.

In addition to the speed-limit, for every 200 connections opened after 200 are already open, the garbage collector is forcefully run. Testing has shown that this can help close sockets lingering in a CLOSE_WAIT state (which counts as an open file).

Why is a speed-limit used instead of just blocking new connections from being opened? Because there are scenarios where this would cause a deadlock:

for r1 in requests.swarm(urls1):
    for r2 in requests.swarm(urls2):
        …
        for r200 in requests.swarm(urls200):
            requests.one(url201)

If the problem is that a server is responding incredibly slowly to a swarm of requests, and even the speed limit isn't helping, your best options are:

Restructure your program to avoid nesting swarms/each.
Drastically decrease concurrent, increase minSecondsBetweenRequests, or add a defaultTimeout (all parameters to Requests).