Processing functions

mwdiffs.utilities.dump2diffs(dump, *args, **kwargs)

Generates a sequence of revision JSON documents containing a ‘diff’ field that represents the change to the text between revisions.

Parameters:
dump : mwxml.Dump

An XML dump to process

diff_engine : deltas.DiffEngine

A configured diff engine for comparing revisions

namespaces : set ( int )

A set of namespace IDs that will be processed. If left unspecified, all namespaces will be processed.

timeout : float

The maximum time in seconds that a difference detection operation should be allowed to consume. This is used to handle extremely computationally complex diffs that occur from time to time. When a diff takes longer than this many seconds, a trivial diff will be reported (remove all the tokens and add them back) and the ‘timedout’ field will be set to True

verbose : bool

Print dots and stuff to stderr

mwdiffs.utilities.revdocs2diffs(rev_docs, diff_engine, namespaces=None, timeout=None, verbose=False)

Generates a sequence of revision JSON documents containing a ‘diff’ field that represents the change to the text between revisions.

Parameters:
rev_docs : iterable ( dict )

A page-partitioned sequence of JSON revision documents

diff_engine : deltas.DiffEngine

A configured diff engine for comparing revisions

namespaces : set ( int )

A set of namespace IDs that will be processed. If left unspecified, all namespaces will be processed.

timeout : float

The maximum time in seconds that a difference detection operation should be allowed to consume. This is used to handle extremely computationally complex diffs that occur from time to time. When a diff takes longer than this many seconds, a trivial diff will be reported (remove all the tokens and add them back) and the ‘timedout’ field will be set to True

verbose : bool

Print dots and stuff to stderr

Previous topic

Command-line utilities

This Page