Reference
A complete reference to crawley’s source code
Crawlers
- 
class crawley.crawlers.base.BaseCrawler(storage=None, sessions=None, debug=False)[source]
 
User’s Crawlers must inherit from this class, may
override some methods and define the start_urls list,
the scrapers and the max crawling depth.
- 
allowed_urls
 
A list of urls allowed for crawl
The extractor class. Default is XPathExtractor
- 
get_urls(html)[source]
 
Returns a list of urls found in the current html page
- 
login
 
The login data. A tuple of (url, login_dict).
Example: (“http://www.mypage.com/login”, {‘user’ : ‘myuser’, ‘pass’, ‘mypassword’})
- 
max_depth
 
The maximun crawling recursive level
- 
post_urls
 
The Post data for the urls. A List of tuples containing (url, data_dict)
Example: (“http://www.mypage.com/post_url”, {‘page’ : ‘1’, ‘color’ : ‘blue’})
- 
scrapers
 
A list of scrapers classes
- 
start()[source]
 
Crawler’s run method
- 
start_urls
 
A list containing the start urls for the crawler
- 
class crawley.crawlers.base.CrawlerMeta(name, bases, dct)[source]
 
This metaclass adds the user’s crawlers to a list
used by the CLI commands.
Abstract base crawlers won’t be added.
 
Scrapers
User’s Scrapers Base
- 
class crawley.scrapers.BaseScraper[source]
 
User’s Scrappers must Inherit from this class,
implement the scrape method and define
the urls that may be procesed by it.
- 
get_urls(response)
 
Return a list of urls in the current html
- 
scrape(response)[source]
 
Define the data you want to extract here
 
Persistance
- 
class crawley.persistance.databases.Entity(**kwargs)[source]
 
Base Entity.
Every Crawley’s Entity must Inherit from this class
- 
class crawley.persistance.databases.UrlEntity(**kwargs)[source]
 
Entity intended to save urls
- 
crawley.persistance.databases.setup(entities)[source]
 
Setup the database based on a list of user’s entities
 
Connector
Database connectors for elixir
- 
class crawley.persistance.connectors.Connector(settings)[source]
 
A Connector represents an object that can provide the
database connection to the elixir framework.
- 
get_connection_string()[source]
 
Returns the connection string to the corresponding database
- 
class crawley.persistance.connectors.HostConnector(settings)[source]
 
A connector for a database that requires host, user and password. I.E: postgres
- 
class crawley.persistance.connectors.MySqlConnector(settings)[source]
 
Mysql Engine connector
- 
class crawley.persistance.connectors.OracleConnector(settings)[source]
 
Oracle Engine connector
- 
class crawley.persistance.connectors.PostgreConnector(settings)[source]
 
Postgre Engine connector
- 
class crawley.persistance.connectors.SimpleConnector(settings)[source]
 
A simple connector for a database without host and user. I.E: sqlite
- 
class crawley.persistance.connectors.SqliteConnector(settings)[source]
 
Sqlite3 Engine connector
 
Utils
Utilities module
- 
crawley.utils.url_matcher(url, pattern)[source]
 
Returns True if the url matches the given pattern
 
Manager
- 
crawley.manager.manage()[source]
 
Called when using crawley command from cmd line
- 
crawley.manager.run_cmd(args)[source]
 
Runs a crawley’s command
 
Commands
Commands Line Tools
- 
class crawley.manager.commands.startproject.StartProjectCommand(args=None, project_type=None, project_name=None)[source]
 
Starts a new crawley project.
Copies the files inside conf/project_template in order 
to generate a new project
- 
class crawley.manager.commands.run.RunCommand(args=None, settings=None)[source]
 
Run the user’s crawler
Reads the crawlers.py file to obtain the user’s crawler classes
and then run these crawlers.
- 
class crawley.manager.commands.syncdb.SyncDbCommand(args=None, settings=None)[source]
 
Build up the DataBase.
Reads the models.py user’s file and generate a database from it.
- 
class crawley.manager.commands.shell.ShellCommand(args)[source]
 
Shows an url data in a console like the XPathExtractor see it.
So users can interactive scrape the data.