MediaWiki XML Processing

This library contains a collection of utilities for efficiently processing MediaWiki’s XML database dumps. There are two important concerns that this module intends to address: performance and the complexity of streaming XML parsing.

Complexity:Streaming XML parsing is gross. XML dumps are (1) some site meta data, (2) a collection of pages that contain (3) collections of revisions. The module allows you to think about dump files in this way and ignore the fact that you’re streaming XML. A mwxml.Dump contains a mwxml.SiteInfo and an iterator of mwxml.Page‘s and/or mwxml.LogItem‘s. For dumps that contain <page> tags, a mwxml.Page contains page meta data and an iterator of mwxml.Revision()‘s. A mwxml.Revision() contains revision meta data and text. For dumps that contain <logitem>, a mwxml.LogItem contains meta data.
Performance:Performance is a serious concern when processing large database XML dumps. Regretfully, python’s Global Interpreter Lock prevents us from running threads on multiple CPUs. This library provides mwxml.map(), a function that maps a dump processing over a set of dump files using multiprocessing to distribute the work over multiple CPUS

Basic example

>>> import mwxml
>>>
>>> dump = mwxml.Dump.from_file(open("dump.xml"))
>>> print(dump.site_info.name, dump.site_info.dbname)
Wikipedia enwiki
>>>
>>> for page in dump:
...     for revision in page:
...        print(revision.id)
...
1
2
3

Authors

Pull requests welcome @ https://github.com/halfak/python-mwxml

Indices and tables