wiki Package

wiki Package

EarwigBot: Wiki Toolset

This is a collection of classes and functions to read from and write to Wikipedia and other wiki sites. No connection whatsoever to python-wikitools written by Mr.Z-man, other than a similar purpose. We share no code.

Import the toolset directly with from earwigbot import wiki. If using the built-in integration with the rest of the bot, Bot objects contain a wiki attribute, which is a SitesDB object tied to the sites.db file located in the same directory as config.yml. That object has the principal methods get_site(), add_site(), and remove_site() that should handle all of your Site (and thus, Page, Category, and User) needs.

category Module

class earwigbot.wiki.category.Category(site, title, follow_redirects=False, pageid=None, logger=None)[source]

EarwigBot: Wiki Toolset: Category

Represents a category on a given Site, a subclass of Page. Provides additional methods, but Page‘s own methods should work fine on Category objects. site.get_page() will return a Category instead of a Page if the given title is in the category namespace; get_category() is shorthand, accepting category names without the namespace prefix.

Attributes:

  • size: the total number of members in the category
  • pages: the number of pages in the category
  • files: the number of files in the category
  • subcats: the number of subcategories in the category

Public methods:

files[source]

The number of files in the category.

This will use either the API or SQL depending on which are enabled and the amount of lag on each. This is handled by site.delegate().

get_members(limit=None, follow_redirects=None)[source]

Iterate over Pages in the category.

If limit is given, we will provide this many pages, or less if the category is smaller. By default, limit is None, meaning we will keep iterating over members until the category is exhausted. follow_redirects is passed directly to site.get_page(); it defaults to None, which will use the value passed to our __init__().

This will use either the API or SQL depending on which are enabled and the amount of lag on each. This is handled by site.delegate().

Note

Be careful when iterating over very large categories with no limit. If using the API, at best, you will make one query per 5000 pages, which can add up significantly for categories with hundreds of thousands of members. As for SQL, note that all page titles are stored internally as soon as the query is made, so the site-wide SQL lock can be freed and unrelated queries can be made without requiring a separate connection to be opened. This is generally not an issue unless your category’s size approaches several hundred thousand, in which case the sheer number of titles in memory becomes problematic.

pages[source]

The number of pages in the category.

This will use either the API or SQL depending on which are enabled and the amount of lag on each. This is handled by site.delegate().

size[source]

The total number of members in the category.

Includes pages, files, and subcats. Equal to pages + files + subcats. This will use either the API or SQL depending on which are enabled and the amount of lag on each. This is handled by site.delegate().

subcats[source]

The number of subcategories in the category.

This will use either the API or SQL depending on which are enabled and the amount of lag on each. This is handled by site.delegate().

constants Module

EarwigBot: Wiki Toolset: Constants

This module defines some useful constants:

  • USER_AGENT: our default User Agent when making API queries
  • NS_*: default namespace IDs for easy lookup

Import directly with from earwigbot.wiki import constants or from earwigbot.wiki.constants import *. These are also available from earwigbot.wiki directly (e.g. earwigbot.wiki.USER_AGENT).

page Module

class earwigbot.wiki.page.Page(site, title, follow_redirects=False, pageid=None, logger=None)[source]

Bases: earwigbot.wiki.copyvios.CopyvioMixIn

EarwigBot: Wiki Toolset: Page

Represents a page on a given Site. Has methods for getting information about the page, getting page content, and so on. Category is a subclass of Page with additional methods.

Attributes:

  • site: the page’s corresponding Site object
  • title: the page’s title, or pagename
  • exists: whether or not the page exists
  • pageid: an integer ID representing the page
  • url: the page’s URL
  • namespace: the page’s namespace as an integer
  • lastrevid: the ID of the page’s most recent revision
  • protection: the page’s current protection status
  • is_talkpage: True if this is a talkpage, else False
  • is_redirect: True if this is a redirect, else False

Public methods:

  • reload(): forcibly reloads the page’s attributes
  • toggle_talk(): returns a content page’s talk page, or vice versa
  • get(): returns the page’s content
  • get_redirect_target(): returns the page’s destination if it is a redirect
  • get_creator(): returns a User object representing the first person to edit the page
  • parse(): parses the page content for templates, links, etc
  • edit(): replaces the page’s content or creates a new page
  • add_section(): adds a new section at the bottom of the page
  • check_exclusion(): checks whether or not we are allowed to edit the page, per {{bots}}/{{nobots}}
  • copyvio_check(): checks the page for copyright violations
  • copyvio_compare(): checks the page like copyvio_check(), but against a specific URL
PAGE_EXISTS = 3
PAGE_INVALID = 1
PAGE_MISSING = 2
PAGE_UNKNOWN = 0
add_section(text, title, minor=False, bot=True, force=False)[source]

Add a new section to the bottom of the page.

The arguments for this are the same as those for edit(), but instead of providing a summary, you provide a section title. Likewise, raised exceptions are the same as edit()‘s.

This should create the page if it does not already exist, with just the new section as content.

check_exclusion(username=None, optouts=None)[source]

Check whether or not we are allowed to edit the page.

Return True if we are allowed to edit this page, and False if we aren’t.

username is used to determine whether we are part of a specific list of allowed or disallowed bots (e.g. {{bots|allow=EarwigBot}} or {{bots|deny=FooBot,EarwigBot}}). It’s None by default, which will swipe our username from site.get_user(). name.

optouts is a list of messages to consider this check as part of for the purpose of opt-out; it defaults to None, which ignores the parameter completely. For example, if optouts is ["nolicense"], we’ll return False on {{bots|optout=nolicense}} or {{bots|optout=all}}, but True on {{bots|optout=orfud,norationale,replaceable}}.

edit(text, summary, minor=False, bot=True, force=False)[source]

Replace the page’s content or creates a new page.

text is the new page content, with summary as the edit summary. If minor is True, the edit will be marked as minor. If bot is True, the edit will be marked as a bot edit, but only if we actually have a bot flag.

Use force to push the new content even if there’s an edit conflict or the page was deleted/recreated between getting our edit token and editing our page. Be careful with this!

exists[source]

Whether or not the page exists.

This will be a number; its value does not matter, but it will equal one of self.PAGE_INVALID, self.PAGE_MISSING, or self.PAGE_EXISTS.

Makes an API query only if we haven’t already made one.

get()[source]

Return page content, which is cached if you try to call get again.

Raises InvalidPageError or PageNotFoundError if the page name is invalid or the page does not exist, respectively.

get_creator()[source]

Return the User object for the first person to edit the page.

Makes an API query only if we haven’t already made one. Normally, we can get the creator along with everything else (except content) in _load_attributes(). However, due to a limitation in the API (can’t get the editor of one revision and the content of another at both ends of the history), if our other attributes were only loaded through get(), we’ll have to do another API query.

Raises InvalidPageError or PageNotFoundError if the page name is invalid or the page does not exist, respectively.

get_redirect_target()[source]

If the page is a redirect, return its destination.

Raises InvalidPageError or PageNotFoundError if the page name is invalid or the page does not exist, respectively. Raises RedirectError if the page is not a redirect.

is_redirect[source]

True if the page is a redirect, otherwise False.

Makes an API query only if we haven’t already made one.

We will return False even if the page does not exist or is invalid.

is_talkpage[source]

True if the page is a talkpage, otherwise False.

Like title(), this won’t do any API queries on its own. If the API was never queried for this page, we will attempt to determine whether it is a talkpage ourselves based on its namespace.

lastrevid[source]

The ID of the page’s most recent revision.

Raises InvalidPageError or PageNotFoundError if the page name is invalid or the page does not exist, respectively.

namespace[source]

The page’s namespace ID (an integer).

Like title(), this won’t do any API queries on its own. If the API was never queried for this page, we will attempt to determine the namespace ourselves based on the title.

pageid[source]

An integer ID representing the page.

Makes an API query only if we haven’t already made one and the pageid parameter to __init__() was left as None, which should be true for all cases except when pages are returned by an SQL generator (like category.get_members()).

Raises InvalidPageError or PageNotFoundError if the page name is invalid or the page does not exist, respectively.

parse()[source]

Parse the page content for templates, links, etc.

Actual parsing is handled by mwparserfromhell. Raises InvalidPageError or PageNotFoundError if the page name is invalid or the page does not exist, respectively.

protection[source]

The page’s current protection status.

Makes an API query only if we haven’t already made one.

Raises InvalidPageError if the page name is invalid. Won’t raise an error if the page is missing because those can still be create-protected.

reload()[source]

Forcibly reload the page’s attributes.

Emphasis on reload: this is only necessary if there is reason to believe they have changed.

site[source]

The page’s corresponding Site object.

title[source]

The page’s title, or “pagename”.

This won’t do any API queries on its own. Any other attributes or methods that do API queries will reload the title, however, like exists and get(), potentially “normalizing” it or following redirects if self._follow_redirects is True.

toggle_talk(follow_redirects=None)[source]

Return a content page’s talk page, or vice versa.

The title of the new page is determined by namespace logic, not API queries. We won’t make any API queries on our own.

If follow_redirects is anything other than None (the default), it will be passed to the new Page object’s __init__(). Otherwise, we’ll use the value passed to our own __init__().

Will raise InvalidPageError if we try to get the talk page of a special page (in the Special: or Media: namespaces), but we won’t raise an exception if our page is otherwise missing or invalid.

url[source]

The page’s URL.

Like title(), this won’t do any API queries on its own. If the API was never queried for this page, we will attempt to determine the URL ourselves based on the title.

site Module

class earwigbot.wiki.site.Site(name=None, project=None, lang=None, base_url=None, article_path=None, script_path=None, sql=None, namespaces=None, login=(None, None), cookiejar=None, user_agent=None, use_https=True, assert_edit=None, maxlag=None, wait_between_queries=2, logger=None, search_config=None)[source]

EarwigBot: Wiki Toolset: Site

Represents a site, with support for API queries and returning Page, User, and Category objects. The constructor takes a bunch of arguments and you probably won’t need to call it directly, rather wiki.get_site() for returning Site instances, wiki.add_site() for adding new ones to our database, and wiki.remove_site() for removing old ones from our database, should suffice.

Attributes:

  • name: the site’s name (or “wikiid”), like "enwiki"
  • project: the site’s project name, like "wikipedia"
  • lang: the site’s language code, like "en"
  • domain: the site’s web domain, like "en.wikipedia.org"
  • url: the site’s URL, like "https://en.wikipedia.org"

Public methods:

SERVICE_API = 1
SERVICE_SQL = 2
SPECIAL_TOKENS = ['deleteglobalaccount', 'patrol', 'rollback', 'setglobalaccountstatus', 'userrights', 'watch']
api_query(**kwargs)[source]

Do an API query with kwargs as the parameters.

This will first attempt to construct an API url from self._base_url and self._script_path. We need both of these, or else we’ll raise APIError. If self._base_url is protocol-relative (introduced in MediaWiki 1.18), we’ll choose HTTPS only if self._user_https is True, otherwise HTTP.

We’ll encode the given params, adding format=json along the way, as well as &assert= and &maxlag= based on self._assert_edit and _maxlag respectively. Additionally, we’ll sleep a bit if the last query was made fewer than self._wait_between_queries seconds ago. The request is made through self._opener, which has cookie support (self._cookiejar), a User-Agent (earwigbot.wiki.constants.USER_AGENT), and Accept-Encoding set to "gzip".

Assuming everything went well, we’ll gunzip the data (if compressed), load it as a JSON object, and return it.

If our request failed for some reason, we’ll raise APIError with details. If that reason was due to maxlag, we’ll sleep for a bit and then repeat the query until we exceed self._max_retries.

There is helpful MediaWiki API documentation at MediaWiki.org.

delegate(services, args=None, kwargs=None)[source]

Delegate a task to either the API or SQL depending on conditions.

services should be a dictionary in which the key is the service name (self.SERVICE_API or self.SERVICE_SQL), and the value is the function to call for this service. All functions will be passed the same arguments the tuple args and the dict kwargs, which are both empty by default. The service order is determined by _get_service_order().

Not every service needs an entry in the dictionary. Will raise NoServiceError if an appropriate service cannot be found.

domain[source]

The Site’s web domain, like "en.wikipedia.org".

get_category(catname, follow_redirects=False, pageid=None)[source]

Return a Category object for the given category name.

catname should be given without a namespace prefix. This method is really just shorthand for get_page("Category:" + catname).

get_maxlag(showall=False)[source]

Return the internal database replication lag in seconds.

In a typical setup, this function returns the replication lag within the WMF’s cluster, not external replication lag affecting the Toolserver (see get_replag() for that). This is useful when combined with the maxlag API query param (added by config), in which queries will be halted and retried if the lag is too high, usually above five seconds.

With showall, will return a list of the lag for all servers in the cluster, not just the one with the highest lag.

get_page(title, follow_redirects=False, pageid=None)[source]

Return a Page object for the given title.

follow_redirects is passed directly to Page‘s constructor. Also, this will return a Category object instead if the given title is in the category namespace. As Category is a subclass of Page, this should not cause problems.

Note that this doesn’t do any direct checks for existence or redirect-following: Page‘s methods provide that.

get_replag()[source]

Return the estimated external database replication lag in seconds.

Requires SQL access. This function only makes sense on a replicated database (e.g. the Wikimedia Toolserver) and on a wiki that receives a large number of edits (ideally, at least one per second), or the result may be larger than expected, since it works by subtracting the current time from the timestamp of the latest recent changes event.

This may raise SQLError or one of oursql’s exceptions (oursql.ProgrammingError, oursql.InterfaceError, ...) if there were problems.

get_token(action=None, force=False)[source]

Return a token for a data-modifying API action.

In general, this will be a CSRF token, unless action is in a special list of non-CSRF tokens. Tokens are cached for the session (until _login() is called again); set force to True to force a new token to be fetched.

Raises APIError if there was an API issue.

get_user(username=None)[source]

Return a User object for the given username.

If username is left as None, then a User object representing the currently logged-in (or anonymous!) user is returned.

lang[source]

The Site’s language code, like "en" or "es".

name[source]

The Site’s name (or “wikiid” in the API), like "enwiki".

namespace_id_to_name(ns_id, all=False)[source]

Given a namespace ID, returns associated namespace names.

If all is False (default), we’ll return the first name in the list, which is usually the localized version. Otherwise, we’ll return the entire list, which includes the canonical name. For example, this returns u"Wikipedia" if ns_id = 4 and all is False on enwiki; returns [u"Wikipedia", u"Project", u"WP"] if ns_id = 4 and all is True.

Raises NamespaceNotFoundError if the ID is not found.

namespace_name_to_id(name)[source]

Given a namespace name, returns the associated ID.

Like namespace_id_to_name(), but reversed. Case is ignored, because namespaces are assumed to be case-insensitive.

Raises NamespaceNotFoundError if the name is not found.

project[source]

The Site’s project name in lowercase, like "wikipedia".

sql_query(query, params=(), plain_query=False, dict_cursor=False, cursor_class=None, show_table=False, buffsize=1024)[source]

Do an SQL query and yield its results.

If plain_query is True, we will force an unparameterized query. Specifying both params and plain_query will cause an error. If dict_cursor is True, we will use oursql.DictCursor as our cursor, otherwise the default oursql.Cursor. If cursor_class is given, it will override this option. If show_table is True, the name of the table will be prepended to the name of the column. This will mainly affect an DictCursor.

buffsize is the size of each memory-buffered group of results, to reduce the number of conversations with the database; it is passed to cursor.fetchmany(). If set to 0`, all results will be buffered in memory at once (this uses fetchall()). If set to 1, it is equivalent to using fetchone().

Example usage:

>>> query = "SELECT user_id, user_registration FROM user WHERE user_name = ?"
>>> params = ("The Earwig",)
>>> result1 = site.sql_query(query, params)
>>> result2 = site.sql_query(query, params, dict_cursor=True)
>>> for row in result1: print row
(7418060L, '20080703215134')
>>> for row in result2: print row
{'user_id': 7418060L, 'user_registration': '20080703215134'}

This may raise SQLError or one of oursql’s exceptions (oursql.ProgrammingError, oursql.InterfaceError, ...) if there were problems with the query.

See _sql_connect() for information on how a connection is acquired. Also relevant is oursql’s documentation for details on that package.

url[source]

The Site’s full base URL, like "https://en.wikipedia.org".

sitesdb Module

class earwigbot.wiki.sitesdb.SitesDB(bot)[source]

EarwigBot: Wiki Toolset: Sites Database Manager

This class controls the sites.db file, which stores information about all wiki sites known to the bot. Three public methods act as bridges between the bot’s config files and Site objects:

There’s usually no need to use this class directly. All public methods here are available as bot.wiki.get_site(), bot.wiki.add_site(), and bot.wiki.remove_site(), which use a sites.db file located in the same directory as our config.yml file. Lower-level access can be achieved by importing the manager class (from earwigbot.wiki import SitesDB).

add_site(project=None, lang=None, base_url=None, script_path='/w', sql=None)[source]

Add a site to the sitesdb so it can be retrieved with get_site().

If only a project and a lang are given, we’ll guess the base_url as "//{lang}.{project}.org" (which is protocol-relative, becoming "https" if useHTTPS is True in config otherwise "http"). If this is wrong, provide the correct base_url as an argument (in which case project and lang are ignored). Most wikis use "/w" as the script path (meaning the API is located at "{base_url}{script_path}/api.php" -> "//{lang}.{project}.org/w/api.php"), so this is the default. If your wiki is different, provide the script_path as an argument. SQL connection settings are guessed automatically using config’s template value. If this is wrong or not specified, provide a dict of kwargs as sql and Site will pass it to oursql.connect(**sql), allowing you to make queries with site.sql_query.

Returns True if the site was added successfully or False if the site is already in our sitesdb (this can be done purposefully to update old site info). Raises SiteNotFoundError if not enough information has been provided to identify the site (e.g. a project but not a lang).

get_site(name=None, project=None, lang=None)[source]

Return a Site instance based on information from the sitesdb.

With no arguments, return the default site as specified by our config file. This is config.wiki["defaultSite"].

With name specified, return the site with that name. This is equivalent to the site’s wikiid in the API, like enwiki.

With project and lang specified, return the site whose project and language match these values. If there are multiple sites with the same values (unlikely), this is not a reliable way of loading a site. Call the function with an explicit name in that case.

We will attempt to login to the site automatically using config.wiki["username"] and config.wiki["password"] if both are defined.

Specifying a project without a lang or a lang without a project will raise TypeError. If all three args are specified, name will be first tried, then project and lang if name doesn’t work. If a site cannot be found in the sitesdb, SiteNotFoundError will be raised. An empty sitesdb will be created if none is found.

remove_site(name=None, project=None, lang=None)[source]

Remove a site from the sitesdb.

Returns True if the site was removed successfully or False if the site was not in our sitesdb originally. If all three args (name, project, and lang) are given, we’ll first try name and then try the latter two if name wasn’t found in the database. Raises TypeError if a project was given but not a language, or vice versa. Will create an empty sitesdb if none was found.

user Module

class earwigbot.wiki.user.User(site, name, logger=None)[source]

EarwigBot: Wiki Toolset: User

Represents a user on a given Site. Has methods for getting a bunch of information about the user, such as editcount and user rights, methods for returning the user’s userpage and talkpage, etc.

Attributes:

  • site: the user’s corresponding Site object
  • name: the user’s username
  • exists: True if the user exists, else False
  • userid: an integer ID representing the user
  • blockinfo: information about any current blocks on the user
  • groups: a list of the user’s groups
  • rights: a list of the user’s rights
  • editcount: the number of edits made by the user
  • registration: the time the user registered
  • emailable: True if you can email the user, or False
  • gender: the user’s gender (“male”/”female”/”unknown”)
  • is_ip: True if this is an IP address, or False

Public methods:

  • reload(): forcibly reloads the user’s attributes
  • get_userpage(): returns a Page object representing the user’s userpage
  • get_talkpage(): returns a Page object representing the user’s talkpage
blockinfo[source]

Information about any current blocks on the user.

If the user is not blocked, returns False. If they are, returns a dict with three keys: "by" is the blocker’s username, "reason" is the reason why they were blocked, and "expiry" is when the block expires.

Raises UserNotFoundError if the user does not exist. Makes an API query only if we haven’t made one already.

editcount[source]

Returns the number of edits made by the user.

Raises UserNotFoundError if the user does not exist. Makes an API query only if we haven’t made one already.

emailable[source]

True if the user can be emailed, or False if they cannot.

Raises UserNotFoundError if the user does not exist. Makes an API query only if we haven’t made one already.

exists[source]

True if the user exists, or False if they do not.

Makes an API query only if we haven’t made one already.

gender[source]

The user’s gender.

Can return either "male", "female", or "unknown", if they did not specify it.

Raises UserNotFoundError if the user does not exist. Makes an API query only if we haven’t made one already.

get_talkpage()[source]

Return a Page object representing the user’s talkpage.

No checks are made to see if it exists or not. Proper site namespace conventions are followed.

get_userpage()[source]

Return a Page object representing the user’s userpage.

No checks are made to see if it exists or not. Proper site namespace conventions are followed.

groups[source]

A list of groups this user is in, including "*".

Raises UserNotFoundError if the user does not exist. Makes an API query only if we haven’t made one already.

is_ip[source]

True if the user is an IP address, or False otherwise.

This tests for IPv4 and IPv6 using socket.inet_pton() on the username. No API queries are made.

name[source]

The user’s username.

This will never make an API query on its own, but if one has already been made by the time this is retrieved, the username may have been “normalized” from the original input to the constructor, converted into a Unicode object, with underscores removed, etc.

registration[source]

The time the user registered as a time.struct_time.

Raises UserNotFoundError if the user does not exist. Makes an API query only if we haven’t made one already.

reload()[source]

Forcibly reload the user’s attributes.

Emphasis on reload: this is only necessary if there is reason to believe they have changed.

rights[source]

A list of this user’s rights.

Raises UserNotFoundError if the user does not exist. Makes an API query only if we haven’t made one already.

site[source]

The user’s corresponding Site object.

userid[source]

An integer ID used by MediaWiki to represent the user.

Raises UserNotFoundError if the user does not exist. Makes an API query only if we haven’t made one already.

Table Of Contents

Previous topic

tasks Package

Next topic

copyvios Package

This Page