MediaWiki XML Processing

This library contains a collection of utilities for efficiently processing MediaWiki’s XML database dumps. There are two important concerns that this module intends to address: performance and the complexity of streaming XML parsing.

Complexity:Streaming XML parsing is gross. XML dumps are (1) some site meta data, (2) a collection of pages that contain (3) collections of revisions. The module allows you to think about dump files in this way and ignore the fact that you’re streaming XML. A mwxml.Dump contains a mwxml.SiteInfo and an iterator of mwxml.Page‘s and/or mwxml.LogItem‘s. For dumps that contain <page> tags, a mwxml.Page contains page meta data and an iterator of mwxml.Revision()‘s. A mwxml.Revision() contains revision meta data and text. For dumps that contain <logitem>, a mwxml.LogItem contains meta data.
Performance:Performance is a serious concern when processing large database XML dumps. Regretfully, python’s Global Interpreter Lock prevents us from running threads on multiple CPUs. This library provides, a function that maps a dump processing over a set of dump files using multiprocessing to distribute the work over multiple CPUS

Basic example

>>> import mwxml
>>> dump = mwxml.Dump.from_file(open("dump.xml"))
>>> print(, dump.site_info.dbname)
Wikipedia enwiki
>>> for page in dump:
...     for revision in page:
...        print(


Pull requests welcome @

Indices and tables