Reference

A complete reference to crawley’s source code

Crawlers

class crawley.crawlers.base.BaseCrawler(storage=None, sessions=None, debug=False)[source]

User’s Crawlers must inherit from this class, may override some methods and define the start_urls list, the scrapers and the max crawling depth.

allowed_urls

A list of urls allowed for crawl

extractor

The extractor class. Default is XPathExtractor

get_urls(html)[source]

Returns a list of urls found in the current html page

login

The login data. A tuple of (url, login_dict). Example: (“http://www.mypage.com/login”, {‘user’ : ‘myuser’, ‘pass’, ‘mypassword’})

max_depth

The maximun crawling recursive level

post_urls

The Post data for the urls. A List of tuples containing (url, data_dict) Example: (“http://www.mypage.com/post_url”, {‘page’ : ‘1’, ‘color’ : ‘blue’})

scrapers

A list of scrapers classes

start()[source]

Crawler’s run method

start_urls

A list containing the start urls for the crawler

class crawley.crawlers.base.CrawlerMeta(name, bases, dct)[source]

This metaclass adds the user’s crawlers to a list used by the CLI commands. Abstract base crawlers won’t be added.

Scrapers

User’s Scrapers Base

class crawley.scrapers.BaseScraper[source]

User’s Scrappers must Inherit from this class, implement the scrape method and define the urls that may be procesed by it.

get_urls(response)

Return a list of urls in the current html

scrape(response)[source]

Define the data you want to extract here

Persistance

class crawley.persistance.databases.Entity(**kwargs)[source]

Base Entity.

Every Crawley’s Entity must Inherit from this class

class crawley.persistance.databases.UrlEntity(**kwargs)[source]

Entity intended to save urls

crawley.persistance.databases.setup(entities)[source]

Setup the database based on a list of user’s entities

Connector

Database connectors for elixir

class crawley.persistance.connectors.Connector(settings)[source]

A Connector represents an object that can provide the database connection to the elixir framework.

get_connection_string()[source]

Returns the connection string to the corresponding database

class crawley.persistance.connectors.HostConnector(settings)[source]

A connector for a database that requires host, user and password. I.E: postgres

class crawley.persistance.connectors.MySqlConnector(settings)[source]

Mysql Engine connector

class crawley.persistance.connectors.OracleConnector(settings)[source]

Oracle Engine connector

class crawley.persistance.connectors.PostgreConnector(settings)[source]

Postgre Engine connector

class crawley.persistance.connectors.SimpleConnector(settings)[source]

A simple connector for a database without host and user. I.E: sqlite

class crawley.persistance.connectors.SqliteConnector(settings)[source]

Sqlite3 Engine connector

Extractors

Data Extractors classes

class crawley.extractors.PyQueryExtractor[source]

Extractor using PyQuery (A JQuery-like library for Python)

class crawley.extractors.RawExtractor[source]

Returns the raw html data Use your favourite python tool to scrape it.

class crawley.extractors.XPathExtractor[source]

Extractor using Xpath

Utils

Utilities module

crawley.utils.url_matcher(url, pattern)[source]

Returns True if the url matches the given pattern

Manager

crawley.manager.manage()[source]

Called when using crawley command from cmd line

crawley.manager.run_cmd(args)[source]

Runs a crawley’s command

Commands

Commands Line Tools

class crawley.manager.commands.startproject.StartProjectCommand(args=None, project_type=None, project_name=None)[source]

Starts a new crawley project.

Copies the files inside conf/project_template in order to generate a new project

class crawley.manager.commands.run.RunCommand(args=None, settings=None)[source]

Run the user’s crawler

Reads the crawlers.py file to obtain the user’s crawler classes and then run these crawlers.

class crawley.manager.commands.syncdb.SyncDbCommand(args=None, settings=None)[source]

Build up the DataBase.

Reads the models.py user’s file and generate a database from it.

class crawley.manager.commands.shell.ShellCommand(args)[source]

Shows an url data in a console like the XPathExtractor see it. So users can interactive scrape the data.

Table Of Contents

Previous topic

Settings File

Next topic

More Examples

This Page