Framework agnostic objects and interfaces to SnapSearch backend service.
Pythonic HTTP Client and Middleware Library for SnapSearch.
copyright: | 2014 by SnapSearch |
---|---|
license: | MIT, see LICENSE for more details. |
author: | LIU Yu |
date: | 2014/03/08 |
Dispatches a URL to SnapSearch backend service, and receives a response containing search engine optimized representation of the URL’s target.
Parameters: | current_url – URL that the search engine robot is currently trying to access. |
---|---|
Returns: | the response from SnapSearch backend service, or None if the code field of the response body is neither "success" nor "validation_error". |
Raises: |
|
Parameters: |
|
---|
Optional arguments:
Parameters: |
|
---|---|
Raises: |
|
Detects if the incoming HTTP request a) came from a search engine robot and b) is eligible for interception. The Detector inspects the following aspects of the incoming HTTP request:
- if the request uses HTTP or HTTPS protocol
- if the request uses HTTP GET method
- if the request is not from any ignored user agenets (ignored robots take precedence over matched robots)
- if the request is accessing any route not matching the whitelist
- if the request is not accessing any route matching the blacklist
- if the request is not accessing any resource with an invalid file extension
- if the request has _escaped_fragment_ query parameter
- if the request is from any matched user agents
Parameters: | request (dict) – incoming HTTP request. |
---|---|
Returns: | RFC 3986 percent-encoded full URL if the incoming HTTP request is eligible for interception, or None otherwise. |
Raises error.SnapSearchError: | |
if the structure of either robots.json or extensions.json is invalid. |
Optional arguments:
Parameters: |
|
---|---|
Raises AssertionError: | |
if extensions.json is specified, yet check_file_extensions is False. |
dict of list‘s of valid file extensions:
{
"generic": [
# valid generic extensions
],
"python": [
# valid python extensions
]
}
Can be changed to customize valid file extensions.
dict of list‘s of user agents from search engine robots:
{
"ignore": [
# user agents to be ignored
]
"match": [
# user agents to be matched
]
}
Can be changed to customize ignored and matched search engine robots. The ignore list takes precedence over the match list.
Intercepts the incoming HTTP request using an associated Detector object to detect for search engine robots. If the request is elegible for interception, the other associated Client object will dispatch the requested URL to SnapSearch backend service for scraping.
Parameters: | request (dict) – incoming HTTP request |
---|---|
Returns: | the response from SnapSearch backend service (or the output of before_intercept(), if specified and if it returns a dict) . On success, the response is a dict containing search engine optimized representation of the URL’s target. |
Parameters: |
|
---|
Optional arguments:
Parameters: |
|
---|---|
Raises: |
|
post-interception callback object
pre-interception callback object
associated Client object
associated Detector object
Backend Service Abstraction Layer for SnapSearch.
copyright: | 2014 by SnapSearch |
---|---|
license: | MIT, see LICENSE for more details. |
author: | LIU Yu |
date: | 2014/03/08 |
Wraps a CGI-style environ (builtin Python dict) with WSGI-defined variables and properties in SnapSearch-specified format and encoding.
parsed QUERY_STRING as a multi-dict (i.e. each key associated with a list of values).
Parameters: | environ (builtin Python dict (see PEP 3333)) – CGI-style environment variables |
---|
underlying CGI-style environ extended with WSGI-defined variables.
getter of environ['REQUEST_METHOD'], or "N/A" if absent.
relative request URL (without HTTP_HOST but with QUERY_STRING), decoded from RFC 3986 percent-encoding (i.e. '%20'-> ' ') and Google’s _escaped_fragment_ protocol.
getter of environ['wsgi.url_scheme'].
full request URL (including HTTP_HOST and QUERY_STRING), encoded to RFC 3986 percent-encoding (i.e. ' '-> '%20'), but decoded from Google’s _escaped_fragment_ protocol.
getter of environ['HTTP_USER_AGENT'], or "" if absent.
copyright: | 2014 by SnapSearch |
---|---|
license: | MIT, see LICENSE for more details. |
author: | LIU Yu |
date: | 2014/03/08 |
Cannot communicate with SnapSearch backend service.
Cannot import package(s) required by SnapSearch.
Common base class for all SnapSearch errros.
Bindings for various Python web technologies.
copyright: | 2014 by SnapSearch |
---|---|
license: | MIT, see LICENSE for more details. |
author: | LIU Yu |
date: | 2014/03/09 |
Wraps a CGI script (see RFC 3875) by temporarily buffering its standard output, until the incoming HTTP request has been investigated by the associated Interceptor object. In case of interception, the buffered data will be replaced with the response from SnapSearch backend service.
Parameters: | interceptor – associated Interceptor object |
---|
Optional argument(s):
Parameters: | response_callback (callable with signature "(response)->
dict") – callback object for handling the response from SnapSearch backend service; the returned dict should at least contain the following three keys:
|
---|---|
Raises AssertionError: | |
if interceptor is not an instance of Interceptor. |
associated Interceptor object.
associated callback object.
Optional argument(s):
Parameters: | environ – incoming HTTP request as a dict of variables, defaults to (a shallow copy of) os.environ. |
---|---|
Raises AssertionError: | |
if environ does not contain CGI-defined variables such as GATEWAY_INTERFACE (see RFC 3875). |
Optional argument(s):
Parameters: | release (bool) – release bufferred data to standard output stream. |
---|
copyright: | 2014 by SnapSearch |
---|---|
license: | MIT, see LICENSE for more details. |
author: | LIU Yu |
date: | 2014/03/08 |
Wraps a WSGI-defined web application (see PEP 3333) and intercepts incoming HTTP requests through the associated Interceptor object.
Parameters: |
|
---|
Optional argument(s):
Parameters: | response_callback (callable with signature "(response)->
dict") – callback object for handling the response from SnapSearch backend service; the returned dict should at least contain the following three keys:
|
---|---|
Raises AssertionError: | |
if interceptor is not an instance of Interceptor. |
associated WSGI application.
associated Interceptor object.
associated callback object.