Reference
A complete reference to crawley’s source code
Crawlers
-
class crawley.crawlers.base.BaseCrawler(storage=None, sessions=None, debug=False)[source]
User’s Crawlers must inherit from this class, may
override some methods and define the start_urls list,
the scrapers and the max crawling depth.
-
allowed_urls
A list of urls allowed for crawl
The extractor class. Default is XPathExtractor
-
get_urls(html)[source]
Returns a list of urls found in the current html page
-
login
The login data. A tuple of (url, login_dict).
Example: (“http://www.mypage.com/login”, {‘user’ : ‘myuser’, ‘pass’, ‘mypassword’})
-
max_depth
The maximun crawling recursive level
-
post_urls
The Post data for the urls. A List of tuples containing (url, data_dict)
Example: (“http://www.mypage.com/post_url”, {‘page’ : ‘1’, ‘color’ : ‘blue’})
-
scrapers
A list of scrapers classes
-
start()[source]
Crawler’s run method
-
start_urls
A list containing the start urls for the crawler
-
class crawley.crawlers.base.CrawlerMeta(name, bases, dct)[source]
This metaclass adds the user’s crawlers to a list
used by the CLI commands.
Abstract base crawlers won’t be added.
Scrapers
User’s Scrapers Base
-
class crawley.scrapers.BaseScraper[source]
User’s Scrappers must Inherit from this class,
implement the scrape method and define
the urls that may be procesed by it.
-
get_urls(response)
Return a list of urls in the current html
-
scrape(response)[source]
Define the data you want to extract here
Persistance
-
class crawley.persistance.databases.Entity(**kwargs)[source]
Base Entity.
Every Crawley’s Entity must Inherit from this class
-
class crawley.persistance.databases.UrlEntity(**kwargs)[source]
Entity intended to save urls
-
crawley.persistance.databases.setup(entities)[source]
Setup the database based on a list of user’s entities
Connector
Database connectors for elixir
-
class crawley.persistance.connectors.Connector(settings)[source]
A Connector represents an object that can provide the
database connection to the elixir framework.
-
get_connection_string()[source]
Returns the connection string to the corresponding database
-
class crawley.persistance.connectors.HostConnector(settings)[source]
A connector for a database that requires host, user and password. I.E: postgres
-
class crawley.persistance.connectors.MySqlConnector(settings)[source]
Mysql Engine connector
-
class crawley.persistance.connectors.OracleConnector(settings)[source]
Oracle Engine connector
-
class crawley.persistance.connectors.PostgreConnector(settings)[source]
Postgre Engine connector
-
class crawley.persistance.connectors.SimpleConnector(settings)[source]
A simple connector for a database without host and user. I.E: sqlite
-
class crawley.persistance.connectors.SqliteConnector(settings)[source]
Sqlite3 Engine connector
Utils
Utilities module
-
crawley.utils.url_matcher(url, pattern)[source]
Returns True if the url matches the given pattern
Manager
-
crawley.manager.manage()[source]
Called when using crawley command from cmd line
-
crawley.manager.run_cmd(args)[source]
Runs a crawley’s command
Commands
Commands Line Tools
-
class crawley.manager.commands.startproject.StartProjectCommand(args=None, project_type=None, project_name=None)[source]
Starts a new crawley project.
Copies the files inside conf/project_template in order
to generate a new project
-
class crawley.manager.commands.run.RunCommand(args=None, settings=None)[source]
Run the user’s crawler
Reads the crawlers.py file to obtain the user’s crawler classes
and then run these crawlers.
-
class crawley.manager.commands.syncdb.SyncDbCommand(args=None, settings=None)[source]
Build up the DataBase.
Reads the models.py user’s file and generate a database from it.
-
class crawley.manager.commands.shell.ShellCommand(args)[source]
Shows an url data in a console like the XPathExtractor see it.
So users can interactive scrape the data.