mw.xml_dump – XML dump processing¶

This module is a collection of utilities for efficiently processing MediaWiki’s XML database dumps. There are two important concerns that this module intends to address: performance and the complexity of streaming XML parsing.

Performance: Performance is a serious concern when processing large database XML dumps. Regretfully, the Global Intepreter Lock prevents us from running threads on multiple CPUs. This library provides a map(), a function that maps a dump processing over a set of dump files using multiprocessing to distribute the work over multiple CPUS
Complexity: Streaming XML parsing is gross. XML dumps are (1) some site meta data, (2) a collection of pages that contain (3) collections of revisions. The module allows you to think about dump files in this way and ignore the fact that you’re streaming XML. An Iterator contains site meta data and an iterator of Page‘s. A Page contains page meta data and an iterator of Revision‘s. A Revision contains revision meta data including a Contributor (if one a contributor was specified in the XML).

The map() function¶

mw.xml_dump.map(paths, process_dump, handle_error=<function re_raise at 0x2ad00f932488>, threads=4, output_buffer=100)¶

Maps a function across a set of dump files and returns an (order not guaranteed) iterator over the output.

The process_dump function must return an iterable object (such as a generator). If your process_dump function does not need to produce output, make it return an empty iterable upon completion (like an empty list).

Parameters:	paths : iter( str ) a list of paths to dump files to process process_dump : function( dump : `Iterator`, path : str) a function to run on every `Iterator` threads : int the number of individual processing threads to spool up output_buffer : int the maximum number of output values to buffer.
Returns:	An iterator over values yielded by calls to process_dump()
Example:	from mw import xml_dump files = ["examples/dump.xml", "examples/dump2.xml"] def page_info(dump, path): for page in dump: yield page.id, page.namespace, page.title for page_id, page_namespace, page_title in xml_dump.map(files, page_info): print(" ".join([str(page_id), str(page_namespace), page_title]))

Iteration¶

class mw.xml_dump.Iterator(site_name=None, base=None, generator=None, case=None, namespaces=None, pages=None)¶

XML Dump Iterator. Dump file meta data and a Page iterator. Instances of this class can be called as an iterator directly. E.g.:

from mw.xml_dump import Iterator

# Construct dump file iterator
dump = Iterator.from_file(open("example/dump.xml"))

# Iterate through pages
for page in dump:

    # Iterate through a page's revisions
    for revision in page:

        print(revision.id)

site_name¶: The name of the site. : str | None (if not specified in the XML)

base¶: TODO: ??? : str | None (if not specified in the XML)

generator¶: TODO: ??? : str | None (if not specified in the XML)

case¶: TODO: ??? : str | None (if not specified in the XML)

namespaces¶: A list of mw.Namespace | None (if not specified in the XML)

class mw.xml_dump.Page(id, title, namespace, redirect, restrictions, revisions=None)¶

Page meta data and a Revision iterator. Instances of this class can be called as iterators directly. E.g.

page = mw.xml_dump.Page( ... )

for revision in page:
    print("{0} {1}".format(revision.id, page_id))

id¶: Page ID : int

title¶: Page title (namespace excluded) : str

namespace¶: Namespace ID : int

redirect¶: Page is currently redirect? : Redirect | None

restrictions¶: A list of page editing restrictions (empty unless restrictions are specified) : list( str )

class mw.xml_dump.Redirect(*args, **kwargs)¶

Represents a redirect tag.

title: Full page name that this page is redirected to : str

class mw.xml_dump.Revision(id, timestamp, contributor=None, minor=None, comment=None, text=None, bytes=None, sha1=None, parent_id=None, model=None, format=None, beginningofpage=False)¶

Revision meta data.

id¶: Revision ID : int

timestamp¶: Revision timestamp : mw.Timestamp

contributor¶: Contributor meta data : Contributor | None

minor¶: Is revision a minor change? : bool

comment¶: Comment left with revision : Comment (behaves like str, with additional members)

text¶: Content of text : Text (behaves like str, with additional members)

bytes¶: Number of bytes of content : str

sha1¶: sha1 hash of the content : str

parent_id¶: Revision ID of preceding revision : int | None

model¶: TODO: ??? : str

format¶: TODO: ??? : str

beginningofpage¶: Is the first revision of a page : bool Used to identify the first revision of a page when using Wikihadoop revision pairs. Otherwise is always set to False. Do not expect to use this when processing an XML dump directly.

class mw.xml_dump.Comment¶

A revision comment. This class behaves identically to str except that it takes and stores an additional parameter recording whether the comment was deleted or not.

>>> from mw.xml_dump import Comment
>>>
>>> c = Comment("foo")
>>> c == "foo"
True
>>> c.deleted
False

deleted: Was the comment deleted? | bool

class mw.xml_dump.Contributor(id, user_text)¶

Contributor meta data.

id¶

User ID : int | None (if not specified in the XML)

User ID of a user if the contributor is signed into an account in the while making the contribution and None when contributors are not signed in.

user_text¶

User name or IP address : str | None (if not specified in the XML)

If a user is logged in, this will reflect the users accout name. If the user is not logged in, this will usually be recorded as the IPv4 or IPv6 address in the XML.

class mw.xml_dump.Text¶

Revision text content. This class behaves identically to str except that it takes and stores an additional set of parameters.

deleted: Was the text deleted? : bool
xml_space: What to do with extra whitespace? : str
id: TODO: ??? : int | None
bytes: TODO: ??? : int | None

>>> from mw.xml_dump import Text
>>>
>>> t = Text("foo")
>>> t == "foo"
True
>>> t.deleted
False
>>> t.xml_space
'preserve'

Errors¶

class mw.xml_dump.errors.FileTypeError[source]¶: Thrown when an XML dump file is not of an expected type.

class mw.xml_dump.errors.MalformedXML[source]¶: Thrown when an XML dump file is not formatted as expected.

mw.xml_dump – XML dump processing¶

The map() function¶

Iteration¶

Errors¶

Table Of Contents

Previous topic

Next topic

This Page

Navigation

mw.xml_dump – XML dump processing¶

The map() function¶

Iteration¶

Errors¶

Table Of Contents

Previous topic

Next topic

This Page

Quick search

Navigation