This module is a collection of utilities for efficiently processing MediaWiki’s XML database dumps. There are two important concerns that this module intends to address: performance and the complexity of streaming XML parsing.
Maps a function across a set of dump files and returns an (order not guaranteed) iterator over the output.
The process_dump function must return an iterable object (such as a generator). If your process_dump function does not need to produce output, make it return an empty iterable upon completion (like an empty list).
Parameters: | |
---|---|
Returns: | An iterator over values yielded by calls to process_dump() |
Example: | from mw import xml_dump
files = ["examples/dump.xml", "examples/dump2.xml"]
def page_info(dump, path):
for page in dump:
yield page.id, page.namespace, page.title
for page_id, page_namespace, page_title in xml_dump.map(files, page_info):
print(" ".join([str(page_id), str(page_namespace), page_title]))
|
XML Dump Iterator. Dump file meta data and a Page iterator. Instances of this class can be called as an iterator directly. E.g.:
from mw.xml_dump import Iterator
# Construct dump file iterator
dump = Iterator.from_file(open("example/dump.xml"))
# Iterate through pages
for page in dump:
# Iterate through a page's revisions
for revision in page:
print(revision.id)
The name of the site. : str | None (if not specified in the XML)
TODO: ??? : str | None (if not specified in the XML)
TODO: ??? : str | None (if not specified in the XML)
TODO: ??? : str | None (if not specified in the XML)
A list of mw.Namespace | None (if not specified in the XML)
Page meta data and a Revision iterator. Instances of this class can be called as iterators directly. E.g.
page = mw.xml_dump.Page( ... )
for revision in page:
print("{0} {1}".format(revision.id, page_id))
Page ID : int
Page title (namespace excluded) : str
Namespace ID : int
A list of page editing restrictions (empty unless restrictions are specified) : list( str )
Represents a redirect tag.
Revision meta data.
Revision ID : int
Revision timestamp : mw.Timestamp
Contributor meta data : Contributor | None
Is revision a minor change? : bool
Number of bytes of content : str
sha1 hash of the content : str
Revision ID of preceding revision : int | None
TODO: ??? : str
TODO: ??? : str
Is the first revision of a page : bool Used to identify the first revision of a page when using Wikihadoop revision pairs. Otherwise is always set to False. Do not expect to use this when processing an XML dump directly.
A revision comment. This class behaves identically to str except that it takes and stores an additional parameter recording whether the comment was deleted or not.
>>> from mw.xml_dump import Comment
>>>
>>> c = Comment("foo")
>>> c == "foo"
True
>>> c.deleted
False
Contributor meta data.
User ID : int | None (if not specified in the XML)
User ID of a user if the contributor is signed into an account in the while making the contribution and None when contributors are not signed in.
User name or IP address : str | None (if not specified in the XML)
If a user is logged in, this will reflect the users accout name. If the user is not logged in, this will usually be recorded as the IPv4 or IPv6 address in the XML.
Revision text content. This class behaves identically to str except that it takes and stores an additional set of parameters.
>>> from mw.xml_dump import Text
>>>
>>> t = Text("foo")
>>> t == "foo"
True
>>> t.deleted
False
>>> t.xml_space
'preserve'