Distributed processing¶
The map function provides a means to perform distributed processing of XML dump files. This is particularly useful for processing multi-file XML dumps, but it can also be used to process XML dumps for many wikis at the same time.
-
mwxml.
map
(process, paths, threads=None)[source]¶ Implements a distributed stategy for processing XML files. This function constructs a set of py:mod:multiprocessing threads (spread over multiple cores) and uses an internal queue to aggregate outputs. To use this function, implement a process() function that takes two arguments – a
mwxml.Dump
and the path the dump was loaded from. Anything that this function ``yield``s will be yielded in turn from themwxml.map()
function.Parameters: - paths : iterable ( str | file )
a list of paths to dump files to process
- process : func
A function that takes a
Dump
and the path the dump was loaded from and yields- threads : int
the number of individual processing threads to spool up
Example: >>> import mwxml >>> files = ["examples/dump.xml", "examples/dump2.xml"] >>> >>> def page_info(dump, path): ... for page in dump: ... yield page.id, page.namespace, page.title ... >>> for id, namespace, title in mwxml.map(page_info, files): ... print(id, namespace, title) ...