Reference¶

A complete reference to crawley’s source code

Crawlers¶

class crawley.crawlers.base.BaseCrawler(storage=None, sessions=None, debug=False)[source]¶

User’s Crawlers must inherit from this class, may override some methods and define the start_urls list, the scrapers and the max crawling depth.

allowed_urls¶: A list of urls allowed for crawl

extractor¶: The extractor class. Default is XPathExtractor

get_urls(html)[source]¶: Returns a list of urls found in the current html page

login¶: The login data. A tuple of (url, login_dict). Example: (“http://www.mypage.com/login”, {‘user’ : ‘myuser’, ‘pass’, ‘mypassword’})

max_depth¶: The maximun crawling recursive level

post_urls¶: The Post data for the urls. A List of tuples containing (url, data_dict) Example: (“http://www.mypage.com/post_url”, {‘page’ : ‘1’, ‘color’ : ‘blue’})

scrapers¶: A list of scrapers classes

start()[source]¶: Crawler’s run method

start_urls¶: A list containing the start urls for the crawler

class crawley.crawlers.base.CrawlerMeta(name, bases, dct)[source]¶: This metaclass adds the user’s crawlers to a list used by the CLI commands. Abstract base crawlers won’t be added.

Scrapers¶

User’s Scrapers Base

class crawley.scrapers.BaseScraper[source]¶

User’s Scrappers must Inherit from this class, implement the scrape method and define the urls that may be procesed by it.

get_urls(response)¶: Return a list of urls in the current html

scrape(response)[source]¶: Define the data you want to extract here

Persistance¶

class crawley.persistance.databases.Entity(**kwargs)[source]¶

Base Entity.

Every Crawley’s Entity must Inherit from this class

class crawley.persistance.databases.UrlEntity(**kwargs)[source]¶: Entity intended to save urls

crawley.persistance.databases.setup(entities)[source]¶: Setup the database based on a list of user’s entities

Connector¶

Database connectors for elixir

class crawley.persistance.connectors.Connector(settings)[source]¶

A Connector represents an object that can provide the database connection to the elixir framework.

get_connection_string()[source]¶: Returns the connection string to the corresponding database

class crawley.persistance.connectors.HostConnector(settings)[source]¶: A connector for a database that requires host, user and password. I.E: postgres

class crawley.persistance.connectors.MySqlConnector(settings)[source]¶: Mysql Engine connector

class crawley.persistance.connectors.OracleConnector(settings)[source]¶: Oracle Engine connector

class crawley.persistance.connectors.PostgreConnector(settings)[source]¶: Postgre Engine connector

class crawley.persistance.connectors.SimpleConnector(settings)[source]¶: A simple connector for a database without host and user. I.E: sqlite

class crawley.persistance.connectors.SqliteConnector(settings)[source]¶: Sqlite3 Engine connector

Extractors¶

Data Extractors classes

class crawley.extractors.PyQueryExtractor[source]¶: Extractor using PyQuery (A JQuery-like library for Python)

class crawley.extractors.RawExtractor[source]¶: Returns the raw html data Use your favourite python tool to scrape it.

class crawley.extractors.XPathExtractor[source]¶: Extractor using Xpath

Utils¶

Utilities module

crawley.utils.url_matcher(url, pattern)[source]¶: Returns True if the url matches the given pattern

Manager¶

crawley.manager.manage()[source]¶: Called when using crawley command from cmd line

crawley.manager.run_cmd(args)[source]¶: Runs a crawley’s command

Commands¶

Commands Line Tools

class crawley.manager.commands.startproject.StartProjectCommand(args=None, project_type=None, project_name=None)[source]¶

Starts a new crawley project.

Copies the files inside conf/project_template in order to generate a new project

class crawley.manager.commands.run.RunCommand(args=None, settings=None)[source]¶

Run the user’s crawler

Reads the crawlers.py file to obtain the user’s crawler classes and then run these crawlers.

class crawley.manager.commands.syncdb.SyncDbCommand(args=None, settings=None)[source]¶

Build up the DataBase.

Reads the models.py user’s file and generate a database from it.

class crawley.manager.commands.shell.ShellCommand(args)[source]¶: Shows an url data in a console like the XPathExtractor see it. So users can interactive scrape the data.

Reference¶

Crawlers¶

Scrapers¶

Persistance¶

Connector¶

Extractors¶

Utils¶

Manager¶

Commands¶

Table Of Contents

Previous topic

Next topic

This Page

Navigation

Reference¶

Crawlers¶

Scrapers¶

Persistance¶

Connector¶

Extractors¶

Utils¶

Manager¶

Commands¶

Table Of Contents

Previous topic

Next topic

This Page

Quick search

Navigation