Distributed processing

The map function provides a means to perform distributed processing of XML dump files. This is particularly useful for processing multi-file XML dumps, but it can also be used to process XML dumps for many wikis at the same time.

mwxml.map(process, paths, threads=None)[source]

Implements a distributed stategy for processing XML files. This function constructs a set of py:mod:multiprocessing threads (spread over multiple cores) and uses an internal queue to aggregate outputs. To use this function, implement a process() function that takes two arguments – a mwxml.Dump and the path the dump was loaded from. Anything that this function ``yield``s will be yielded in turn from the mwxml.map() function.

Parameters:
paths
: iterable ( str | file )

a list of paths to dump files to process

process
: func

A function that takes a Dump and the path the dump was loaded from and yields

threads
: int

the number of individual processing threads to spool up

Example:
>>> import mwxml
>>> files = ["examples/dump.xml", "examples/dump2.xml"]
>>>
>>> def page_info(dump, path):
...     for page in dump:
...         yield page.id, page.namespace, page.title
...
>>> for id, namespace, title in mwxml.map(page_info, files):
...     print(id, namespace, title)
...