MediaWiki XML Processing¶
This library contains a collection of utilities for efficiently processing MediaWiki’s XML database dumps. There are two important concerns that this module intends to address: performance and the complexity of streaming XML parsing.
Complexity: | Streaming XML parsing is gross. XML dumps are (1) some site meta data, (2)
a collection of pages that contain (3) collections of revisions. The
module allows you to think about dump files in this way and ignore the
fact that you’re streaming XML. A mwxml.Dump contains
a mwxml.SiteInfo and an iterator of mwxml.Page ‘s and/or
mwxml.LogItem ‘s. For dumps that contain <page> tags,
a mwxml.Page contains page meta data
and an iterator of mwxml.Revision() ‘s.
A mwxml.Revision() contains revision meta data and text.
For dumps that contain <logitem> , a mwxml.LogItem contains
meta data. |
---|---|
Performance: | Performance is a serious concern when processing large database XML dumps.
Regretfully, python’s Global Interpreter Lock prevents us from running
threads on multiple CPUs. This library provides mwxml.map() , a
function that maps a dump processing over a set of dump files using
multiprocessing to distribute the work over multiple CPUS |
Basic example¶
>>> import mwxml
>>>
>>> dump = mwxml.Dump.from_file(open("dump.xml"))
>>> print(dump.site_info.name, dump.site_info.dbname)
Wikipedia enwiki
>>>
>>> for page in dump:
... for revision in page:
... print(revision.id)
...
1
2
3
Authors¶
- Aaron Halfaker – https://github.com/halfak
Pull requests welcome @ https://github.com/halfak/python-mwxml