mw.xml_dump – XML dump processing

This module is a collection of utilities for efficiently processing MediaWiki’s XML database dumps. There are two important concerns that this module intends to address: performance and the complexity of streaming XML parsing.

Performance
Performance is a serious concern when processing large database XML dumps. Regretfully, the Global Intepreter Lock prevents us from running threads on multiple CPUs. This library provides a map(), a function that maps a dump processing over a set of dump files using multiprocessing to distribute the work over multiple CPUS
Complexity
Streaming XML parsing is gross. XML dumps are (1) some site meta data, (2) a collection of pages that contain (3) collections of revisions. The module allows you to think about dump files in this way and ignore the fact that you’re streaming XML. An Iterator contains site meta data and an iterator of Page‘s. A Page contains page meta data and an iterator of Revision‘s. A Revision contains revision meta data including a Contributor (if one a contributor was specified in the XML).

The map() function

mw.xml_dump.map(paths, process_dump, handle_error=<function re_raise at 0x2ad00f932488>, threads=4, output_buffer=100)

Maps a function across a set of dump files and returns an (order not guaranteed) iterator over the output.

The process_dump function must return an iterable object (such as a generator). If your process_dump function does not need to produce output, make it return an empty iterable upon completion (like an empty list).

Parameters:
paths : iter( str )

a list of paths to dump files to process

process_dump : function( dump : Iterator, path : str)

a function to run on every Iterator

threads : int

the number of individual processing threads to spool up

output_buffer : int

the maximum number of output values to buffer.

Returns:

An iterator over values yielded by calls to process_dump()

Example:
from mw import xml_dump

files = ["examples/dump.xml", "examples/dump2.xml"]

def page_info(dump, path):
    for page in dump:

        yield page.id, page.namespace, page.title


for page_id, page_namespace, page_title in xml_dump.map(files, page_info):
    print(" ".join([str(page_id), str(page_namespace), page_title]))

Iteration

class mw.xml_dump.Iterator(site_name=None, base=None, generator=None, case=None, namespaces=None, pages=None)

XML Dump Iterator. Dump file meta data and a Page iterator. Instances of this class can be called as an iterator directly. E.g.:

from mw.xml_dump import Iterator

# Construct dump file iterator
dump = Iterator.from_file(open("example/dump.xml"))

# Iterate through pages
for page in dump:

    # Iterate through a page's revisions
    for revision in page:

        print(revision.id)
site_name

The name of the site. : str | None (if not specified in the XML)

base

TODO: ??? : str | None (if not specified in the XML)

generator

TODO: ??? : str | None (if not specified in the XML)

case

TODO: ??? : str | None (if not specified in the XML)

namespaces

A list of mw.Namespace | None (if not specified in the XML)

class mw.xml_dump.Page(id, title, namespace, redirect, restrictions, revisions=None)

Page meta data and a Revision iterator. Instances of this class can be called as iterators directly. E.g.

page = mw.xml_dump.Page( ... )

for revision in page:
    print("{0} {1}".format(revision.id, page_id))
id

Page ID : int

title

Page title (namespace excluded) : str

namespace

Namespace ID : int

redirect

Page is currently redirect? : Redirect | None

restrictions

A list of page editing restrictions (empty unless restrictions are specified) : list( str )

class mw.xml_dump.Redirect(*args, **kwargs)

Represents a redirect tag.

title
Full page name that this page is redirected to : str
class mw.xml_dump.Revision(id, timestamp, contributor=None, minor=None, comment=None, text=None, bytes=None, sha1=None, parent_id=None, model=None, format=None, beginningofpage=False)

Revision meta data.

id

Revision ID : int

timestamp

Revision timestamp : mw.Timestamp

contributor

Contributor meta data : Contributor | None

minor

Is revision a minor change? : bool

comment

Comment left with revision : Comment (behaves like str, with additional members)

text

Content of text : Text (behaves like str, with additional members)

bytes

Number of bytes of content : str

sha1

sha1 hash of the content : str

parent_id

Revision ID of preceding revision : int | None

model

TODO: ??? : str

format

TODO: ??? : str

beginningofpage

Is the first revision of a page : bool Used to identify the first revision of a page when using Wikihadoop revision pairs. Otherwise is always set to False. Do not expect to use this when processing an XML dump directly.

class mw.xml_dump.Comment

A revision comment. This class behaves identically to str except that it takes and stores an additional parameter recording whether the comment was deleted or not.

>>> from mw.xml_dump import Comment
>>>
>>> c = Comment("foo")
>>> c == "foo"
True
>>> c.deleted
False
deleted
Was the comment deleted? | bool
class mw.xml_dump.Contributor(id, user_text)

Contributor meta data.

id

User ID : int | None (if not specified in the XML)

User ID of a user if the contributor is signed into an account in the while making the contribution and None when contributors are not signed in.

user_text

User name or IP address : str | None (if not specified in the XML)

If a user is logged in, this will reflect the users accout name. If the user is not logged in, this will usually be recorded as the IPv4 or IPv6 address in the XML.

class mw.xml_dump.Text

Revision text content. This class behaves identically to str except that it takes and stores an additional set of parameters.

deleted
Was the text deleted? : bool
xml_space
What to do with extra whitespace? : str
id
TODO: ??? : int | None
bytes
TODO: ??? : int | None
>>> from mw.xml_dump import Text
>>>
>>> t = Text("foo")
>>> t == "foo"
True
>>> t.deleted
False
>>> t.xml_space
'preserve'

Errors

class mw.xml_dump.errors.FileTypeError[source]

Thrown when an XML dump file is not of an expected type.

class mw.xml_dump.errors.MalformedXML[source]

Thrown when an XML dump file is not formatted as expected.

Table Of Contents

Previous topic

mw.database – MySQL database abstraction

Next topic

mw.lib.persistence – tracking content between revisions

This Page