Command-line utilities

This module implements a set of utilities for generating diffs and content persistence, statistics from the command-line. When the mwpersistence python package is installed, an mwpersistence utility should be available from the command-line. Run mwpersistence -h for more information:

mwdiffs dump2diffs

$ mwdiffs dump2diffs -h

Computes diffs from an XML dump.

Usage:
    dump2diffs (-h|--help)
    dump2diffs [<input-file>...] --config=<path> [--namespaces=<ids>]
               [--timeout=<secs>] [--keep-text] [--threads=<num>]
               [--output=<path>] [--compress=<type>] [--verbose] [--debug]

Options:
    -h|--help           Print this documentation
    <input-file>        The path to a MediaWiki XML Dump file
                        [default: <stdin>]
    --config=<path>     The path to a deltas DiffEngine configuration
    --namespaces=<ids>  A comma separated list of namespace IDs to be
                        considered [default: <all>]
    --timeout=<secs>    The maximum number of seconds that a diff will be
                        able to run before being stopped [default: 10]
    --keep-text         If set, the 'text' field will not be dropped after
                        diffs are computed.
    --threads=<num>     If a collection of files are provided, how many
                        processor threads? [default: <cpu_count>]
    --output=<path>     Write output to a directory with one output file
                        per input path.  [default: <stdout>]
    --compress=<type>   If set, output written to the output-dir will be
                        compressed in this format. [default: bz2]
    --verbose           Print progress information to stderr.
    --debug             Prints debug logs to stder.

mwdiffs revdocs2diffs

$ mwdiffs revdocs2diffs -h

Computes diffs from a page-partitioned sequence of JSON revision documents.

Usage:
    revdocs2diffs (-h|--help)
    revdocs2diffs [<input-file>...] --config=<path> [--namespaces=<ids>]
                  [--timeout=<secs>] [--keep-text] [--threads=<num>]
                  [--output=<path>] [--compress=<type>] [--verbose]
                  [--debug]

Options:
    -h|--help           Print this documentation
    <input-file>        The path to file containing a page-partitioned
                        sequence of JSON revision documents
                        [default: <stdin>]
    --config=<path>     The path to a deltas DiffEngine configuration
    --namespaces=<ids>  A comma separated list of namespace IDs to be
                        considered [default: <all>]
    --timeout=<secs>    The maximum number of seconds that a diff will be
                        able to run before being stopped [default: 10]
    --keep-text         If set, the 'text' field will be populated in the
                        output JSON.
    --threads=<num>     If a collection of files are provided, how many
                        processor threads? [default: <cpu_count>]
    --output=<path>     Write output to a directory with one output file
                        per input path.  [default: <stdout>]
    --compress=<type>   If set, output written to the output-dir will be
                        compressed in this format. [default: bz2]
    --verbose           Print progress information to stderr.
    --debug             Prints debug logs to stder.

Table Of Contents

Previous topic

MediaWiki Diffs

Next topic

Processing functions

This Page