mw.xml_dump – XML dump processing

This module is a collection of utilities for efficiently processing MediaWiki’s XML database dumps. There are two important concerns that this module intends to address: performance and the complexity of streaming XML parsing.

Performance is a serious concern when processing large database XML dumps. Regretfully, the Global Intepreter Lock prevents us from running threads on multiple CPUs. This library provides a map(), a function that maps a dump processing over a set of dump files using multiprocessing to distribute the work over multiple CPUS
Streaming XML parsing is gross. XML dumps are (1) some site meta data, (2) a collection of pages that contain (3) collections of revisions. The module allows you to think about dump files in this way and ignore the fact that you’re streaming XML. An Iterator contains site meta data and an iterator of Page‘s. A Page contains page meta data and an iterator of Revision‘s. A Revision contains revision meta data including a Contributor (if one a contributor was specified in the XML).

The map() function, process_dump, handle_error=<function re_raise at 0x2ad00f932488>, threads=4, output_buffer=100)

Maps a function across a set of dump files and returns an (order not guaranteed) iterator over the output.

The process_dump function must return an iterable object (such as a generator). If your process_dump function does not need to produce output, make it return an empty iterable upon completion (like an empty list).

paths : iter( str )

a list of paths to dump files to process

process_dump : function( dump : Iterator, path : str)

a function to run on every Iterator

threads : int

the number of individual processing threads to spool up

output_buffer : int

the maximum number of output values to buffer.


An iterator over values yielded by calls to process_dump()

from mw import xml_dump

files = ["examples/dump.xml", "examples/dump2.xml"]

def page_info(dump, path):
    for page in dump:

        yield, page.namespace, page.title

for page_id, page_namespace, page_title in, page_info):
    print(" ".join([str(page_id), str(page_namespace), page_title]))


class mw.xml_dump.Iterator(site_name=None, base=None, generator=None, case=None, namespaces=None, pages=None)

XML Dump Iterator. Dump file meta data and a Page iterator. Instances of this class can be called as an iterator directly. E.g.:

from mw.xml_dump import Iterator

# Construct dump file iterator
dump = Iterator.from_file(open("example/dump.xml"))

# Iterate through pages
for page in dump:

    # Iterate through a page's revisions
    for revision in page:


The name of the site. : str | None (if not specified in the XML)


TODO: ??? : str | None (if not specified in the XML)


TODO: ??? : str | None (if not specified in the XML)


TODO: ??? : str | None (if not specified in the XML)


A list of mw.Namespace | None (if not specified in the XML)

class mw.xml_dump.Page(id, title, namespace, redirect, restrictions, revisions=None)

Page meta data and a Revision iterator. Instances of this class can be called as iterators directly. E.g.

page = mw.xml_dump.Page( ... )

for revision in page:
    print("{0} {1}".format(, page_id))

Page ID : int


Page title (namespace excluded) : str


Namespace ID : int


Page is currently redirect? : Redirect | None


A list of page editing restrictions (empty unless restrictions are specified) : list( str )

class mw.xml_dump.Redirect(*args, **kwargs)

Represents a redirect tag.

Full page name that this page is redirected to : str
class mw.xml_dump.Revision(id, timestamp, contributor=None, minor=None, comment=None, text=None, bytes=None, sha1=None, parent_id=None, model=None, format=None, beginningofpage=False)

Revision meta data.


Revision ID : int


Revision timestamp : mw.Timestamp


Contributor meta data : Contributor | None


Is revision a minor change? : bool


Comment left with revision : Comment (behaves like str, with additional members)


Content of text : Text (behaves like str, with additional members)


Number of bytes of content : str


sha1 hash of the content : str


Revision ID of preceding revision : int | None


TODO: ??? : str


TODO: ??? : str


Is the first revision of a page : bool Used to identify the first revision of a page when using Wikihadoop revision pairs. Otherwise is always set to False. Do not expect to use this when processing an XML dump directly.

class mw.xml_dump.Comment

A revision comment. This class behaves identically to str except that it takes and stores an additional parameter recording whether the comment was deleted or not.

>>> from mw.xml_dump import Comment
>>> c = Comment("foo")
>>> c == "foo"
>>> c.deleted
Was the comment deleted? | bool
class mw.xml_dump.Contributor(id, user_text)

Contributor meta data.


User ID : int | None (if not specified in the XML)

User ID of a user if the contributor is signed into an account in the while making the contribution and None when contributors are not signed in.


User name or IP address : str | None (if not specified in the XML)

If a user is logged in, this will reflect the users accout name. If the user is not logged in, this will usually be recorded as the IPv4 or IPv6 address in the XML.

class mw.xml_dump.Text

Revision text content. This class behaves identically to str except that it takes and stores an additional set of parameters.

Was the text deleted? : bool
What to do with extra whitespace? : str
TODO: ??? : int | None
TODO: ??? : int | None
>>> from mw.xml_dump import Text
>>> t = Text("foo")
>>> t == "foo"
>>> t.deleted
>>> t.xml_space


class mw.xml_dump.errors.FileTypeError[source]

Thrown when an XML dump file is not of an expected type.

class mw.xml_dump.errors.MalformedXML[source]

Thrown when an XML dump file is not formatted as expected.

Table Of Contents

Previous topic

mw.database – MySQL database abstraction

Next topic

mw.lib.persistence – tracking content between revisions

This Page