Packages and Modules

Core Objects

Framework agnostic objects and interfaces to SnapSearch backend service.

SnapSearch

Pythonic HTTP Client and Middleware Library for SnapSearch.

copyright:2014 by SnapSearch
license:MIT, see LICENSE for more details.
author:LIU Yu
date:2014/03/08
class SnapSearch.Client(api_email, api_key, request_parameters={}, api_url=None, ca_path=None)

Dispatches a URL to SnapSearch backend service, and receives a response containing search engine optimized representation of the URL’s target.

__call__(current_url)
Parameters:

current_url – URL that the search engine robot is currently trying to access.

Returns:

the response from SnapSearch backend service, or None if the code field of the response body is neither "success" nor "validation_error".

Raises:
  • error.SnapSearchError – if either the status or headers property of the response is empty.
  • error.SnapSearchError – if the response body is malformed (i.e. not containing fields "code" and "content").
  • error.SnapSearchError – if the code field of the response body equals "validation_error".
__init__(api_email, api_key, request_parameters={}, api_url=None, ca_path=None)
Parameters:
  • api_email – registered email as username for authentication against the SnapSearch backend service.
  • api_key – api key as password for authentication against the SnapSearch backend service.

Optional arguments:

Parameters:
  • request_parametersdict of parameters to be json-serialized and sent to SnapSearch backend service.
  • api_url – URL to SnapSearch backend service.
  • ca_path – absolute path to an external CA bundle file.
Raises:
  • error.SnapSearchError – if api_url uses a non-https scheme (i.e. not starting with "https://").
  • error.SnapSearchError – if ca_path is either invalid or inaccessible.
class SnapSearch.Detector(ignored_routes=[], matched_routes=[], check_file_extensions=False, robots_json=None, extensions_json=None)

Detects if the incoming HTTP request a) came from a search engine robot and b) is eligible for interception. The Detector inspects the following aspects of the incoming HTTP request:

  1. if the request uses HTTP or HTTPS protocol
  2. if the request uses HTTP GET method
  3. if the request is not from any ignored user agenets (ignored robots take precedence over matched robots)
  4. if the request is accessing any route not matching the whitelist
  5. if the request is not accessing any route matching the blacklist
  6. if the request is not accessing any resource with an invalid file extension
  7. if the request has _escaped_fragment_ query parameter
  8. if the request is from any matched user agents
__call__(request)
Parameters:request (dict) – incoming HTTP request.
Returns:RFC 3986 percent-encoded full URL if the incoming HTTP request is eligible for interception, or None otherwise.
Raises error.SnapSearchError:
 if the structure of either robots.json or extensions.json is invalid.
__init__(ignored_routes=[], matched_routes=[], check_file_extensions=False, robots_json=None, extensions_json=None)

Optional arguments:

Parameters:
  • ignored_routes (list or tuple) – blacklisted route regular expressions.
  • matched_routes (list or tuple) – whitelisted route regular expressions.
  • check_file_extensions (bool) – to check if the URL is going to a static file resource that should not be intercepted.
  • robots_json – absolute path to an external robots.json file.
  • extensions_json – absolute path to an external extensions.json file.
Raises AssertionError:
 

if extensions.json is specified, yet check_file_extensions is False.

extensions

dict of list‘s of valid file extensions:

{
    "generic": [
        # valid generic extensions
    ],
    "python": [
        # valid python extensions
    ]
}

Can be changed to customize valid file extensions.

robots

dict of list‘s of user agents from search engine robots:

{
    "ignore": [
        # user agents to be ignored
    ]
    "match": [
        # user agents to be matched
    ]
}

Can be changed to customize ignored and matched search engine robots. The ignore list takes precedence over the match list.

class SnapSearch.Interceptor(client, detector, before_intercept=None, after_intercept=None)

Intercepts the incoming HTTP request using an associated Detector object to detect for search engine robots. If the request is elegible for interception, the other associated Client object will dispatch the requested URL to SnapSearch backend service for scraping.

__call__(request)
Parameters:request (dict) – incoming HTTP request
Returns:the response from SnapSearch backend service (or the output of before_intercept(), if specified and if it returns a dict) . On success, the response is a dict containing search engine optimized representation of the URL’s target.
__init__(client, detector, before_intercept=None, after_intercept=None)
Parameters:
  • client – initialized Client object to associate
  • detector – initialized Detector object to associate

Optional arguments:

Parameters:
  • before_intercept (callable with signature "(url) -> result") – pre-interception callback object.
  • after_intercept (callable with signature "(url, response) -> None") – post-interception callable object
Raises:
  • AssertionError – if client is not an instance of Client
  • AssertionError – if detector is not an instance of Detector.
after_intercept

post-interception callback object

before_intercept

pre-interception callback object

client

associated Client object

detector

associated Detector object

SnapSearch.api

Backend Service Abstraction Layer for SnapSearch.

copyright:2014 by SnapSearch
license:MIT, see LICENSE for more details.
author:LIU Yu
date:2014/03/08
class SnapSearch.api.AnyEnv(environ)

Wraps a CGI-style environ (builtin Python dict) with WSGI-defined variables and properties in SnapSearch-specified format and encoding.

GET

parsed QUERY_STRING as a multi-dict (i.e. each key associated with a list of values).

__init__(environ)
Parameters:environ (builtin Python dict (see PEP 3333)) – CGI-style environment variables
environ

underlying CGI-style environ extended with WSGI-defined variables.

method

getter of environ['REQUEST_METHOD'], or "N/A" if absent.

path_qs

relative request URL (without HTTP_HOST but with QUERY_STRING), decoded from RFC 3986 percent-encoding (i.e. '%20'-> ' ') and Google’s _escaped_fragment_ protocol.

scheme

getter of environ['wsgi.url_scheme'].

url

full request URL (including HTTP_HOST and QUERY_STRING), encoded to RFC 3986 percent-encoding (i.e. ' '-> '%20'), but decoded from Google’s _escaped_fragment_ protocol.

user_agent

getter of environ['HTTP_USER_AGENT'], or "" if absent.

class SnapSearch.api.Response(**kwds)

Wraps an HTTP response from SnapSearch’s backend service.

body

response body (dict)

headers

HTTP headers (dict)

status

HTTP status code (int)

SnapSearch.error

copyright:2014 by SnapSearch
license:MIT, see LICENSE for more details.
author:LIU Yu
date:2014/03/08
exception SnapSearch.error.SnapSearchConnectionError(*args, **kwds)

Cannot communicate with SnapSearch backend service.

exception SnapSearch.error.SnapSearchDependencyError(*args, **kwds)

Cannot import package(s) required by SnapSearch.

exception SnapSearch.error.SnapSearchError(*args, **kwds)

Common base class for all SnapSearch errros.

Extensions

Bindings for various Python web technologies.

SnapSearch.cgi

copyright:2014 by SnapSearch
license:MIT, see LICENSE for more details.
author:LIU Yu
date:2014/03/09
class SnapSearch.cgi.InterceptorController(interceptor, response_callback=None)

Wraps a CGI script (see RFC 3875) by temporarily buffering its standard output, until the incoming HTTP request has been investigated by the associated Interceptor object. In case of interception, the buffered data will be replaced with the response from SnapSearch backend service.

__init__(interceptor, response_callback=None)
Parameters:interceptor – associated Interceptor object

Optional argument(s):

Parameters:response_callback (callable with signature "(response)-> dict") –

callback object for handling the response from SnapSearch backend service; the returned dict should at least contain the following three keys:

  • status: full HTTP status with code and string message
  • headers: list of 2-tuple for HTTP message headers
  • html: HTML content of the scrapped URL
Raises AssertionError:
 if interceptor is not an instance of Interceptor.
interceptor

associated Interceptor object.

response_callback

associated callback object.

start(environ=None)

Optional argument(s):

Parameters:environ – incoming HTTP request as a dict of variables, defaults to (a shallow copy of) os.environ.
Raises AssertionError:
 if environ does not contain CGI-defined variables such as GATEWAY_INTERFACE (see RFC 3875).
stop(release=False)

Optional argument(s):

Parameters:release (bool) – release bufferred data to standard output stream.

SnapSearch.wsgi

copyright:2014 by SnapSearch
license:MIT, see LICENSE for more details.
author:LIU Yu
date:2014/03/08
class SnapSearch.wsgi.InterceptorMiddleware(application, interceptor, response_callback=None)

Wraps a WSGI-defined web application (see PEP 3333) and intercepts incoming HTTP requests through the associated Interceptor object.

__call__(environ, start_response)

WSGI-defined web application interface (see PEP 3333).

__init__(application, interceptor, response_callback=None)
Parameters:
  • application – associated (wrapped) WSGI application object
  • interceptor – associated Interceptor object

Optional argument(s):

Parameters:response_callback (callable with signature "(response)-> dict") –

callback object for handling the response from SnapSearch backend service; the returned dict should at least contain the following three keys:

  • status: full HTTP status with code and string message
  • headers: list of 2-tuple for HTTP message headers
  • html: HTML content of the scrapped URL
Raises AssertionError:
 if interceptor is not an instance of Interceptor.
application

associated WSGI application.

interceptor

associated Interceptor object.

response_callback

associated callback object.

Table Of Contents

Previous topic

Integration with Python CGI Scripts

Next topic

Future Development