Command-line utilities

This module implements a set of utilities for generating diffs and content persistence, statistics from the command-line. When the mwpersistence python package is installed, an mwpersistence utility should be available from the command-line. Run mwpersistence -h for more information:

mwpersistence diffs2persistence

$ mwpersistence diffs2persistence -h

Generates token persistence information from JSON revision documents
annotated with diff information (see `mwdiffs dump2diffs|revdocs2diffs`).

This utility expects to be fed revision documents in as a page-partitioned
chronological sequence so that diffs can be computed upon in order.

This utility uses a processing 'window' to limit memory usage.  New
revisions enter the head of the window and old revisions fall off the tail.
Stats are generated at the tail of the window.

::
                           window
                      .------+------.

    revisions ========[=============]=============>

                    /                \
                [tail]              [head]


Usage:
    diffs2persistence (-h|--help)
    diffs2persistence [<input-file>...] --sunset=<date>
                      [--window=<revs>] [--revert-radius=<revs>]
                      [--keep-diff] [--threads=<num>] [--output=<path>]
                      [--compress=<type>] [--verbose] [--debug]

Options:
    -h|--help               Prints this documentation
    <input-file>            The path a file containing page-partitioned
                            JSON revision documents with a 'diff' field to
                            process.
    --sunset=<date>         The date of the database dump we are generating
                            from.  This is used to apply a 'time visible'
                            statistic.  Expects %Y-%m-%dT%H:%M:%SZ".
                            [default: <now>]
    --window=<revs>         The size of the window of revisions from which
                            persistence data will be generated.
                            [default: 50]
    --revert-radius=<revs>  The number of revisions back that a revert can
                            reference. [default: 15]
    --keep-diff             Do not drop 'diff' field data from the json
                            blobs.
    --threads=<num>         If a collection of files are provided, how many
                            processor threads should be prepare?
                            [default: <cpu_count>]
    --output=<path>         Write output to a directory with one output
                            file per input path.  [default: <stdout>]
    --compress=<type>       If set, output written to the output-dir will
                            be compressed in this format. [default: bz2]
    --verbose               Print dots and stuff to stderr
    --debug                 Print debug logging to stderr.

mwpersistence persistence2stats

$ mwpersistence persistence2stats -h

Generates revision-level statistics from a sequence of token persistence
infused revision documents into revision statistics.

Usage:
    persistence2stats (-h | --help)
    persistence2stats [<input-file>...] [--min-persisted=<num>]
                      [--min-visible=<days>] [--include=<regex>]
                      [--exclude=<regex>] [--keep-tokens] [--threads=<num>]
                      [--output=<path>] [--compress=<type>] [--verbose]
                      [--debug]

Options:
    -h --help              Print this documentation
    <input-file>           The path to a file containing persistence data.
                           [default: <stdin>]
    --min-persisted=<num>  The minimum number of revisions a token must
                           survive before being considered "persisted"
                           [default: 5]
    --min-visible=<days>   The minimum amount of time a token must survive
                           before being considered "persisted" (in days)
                           [default: 14]
    --include=<regex>      A regex matching tokens to include (case
                           insensitive) [default: <all>]
    --exclude=<regex>      A regex matching tokens to exclude (case
                           insensitive) [default: <none>]
    --keep-tokens          Do not drop 'tokens' field data from the JSON
                           document.
    --threads=<num>        If a collection of files are provided, how many
                           processor threads should be prepare?
                           [default: <cpu_count>]
    --output=<path>        Write output to a directory with one output file
                           per input path.  [default: <stdout>]
    --compress=<type>      If set, output written to the output-dir will be
                           compressed in this format. [default: bz2]
    --verbose              Print out progress information
    --debug                Print debug logging to stderr.

mwpersistence dump2stats

$ mwpersistence dump2stats -h

Full pipeline from MediaWiki XML dumps to content persistence statistics.

Usage:
    dump2stats (-h|--help)
    dump2stats [<input-file>...] --config=<path> --sunset=<date>
               [--namespaces=<ids>] [--timeout=<secs>]
               [--window=<revs>] [--revert-radius=<revs>]
               [--min-persisted=<num>] [--min-visible=<days>]
               [--include=<regex>] [--exclude=<regex>]
               [--keep-text] [--keep-diff] [--keep-tokens]
               [--threads=<num>] [--output=<path>] [--compress=<type>]
               [--verbose] [--debug]

Options:
    -h|--help               Print this documentation
    <input-file>            The path to a MediaWiki XML Dump file
                            [default: <stdin>]
    --config=<path>         The path to a deltas DiffEngine configuration
    --namespaces=<ids>      A comma separated list of namespace IDs to be
                            considered [default: <all>]
    --timeout=<secs>        The maximum number of seconds that a diff will
                            be allowed to run before being stopped
                            [default: 10]
    --sunset=<date>         The date of the database dump we are generating
                            from.  This is used to apply a 'time visible'
                            statistic.  Expects %Y-%m-%dT%H:%M:%SZ".
                            [default: <now>]
    --window=<revs>         The size of the window of revisions from which
                            persistence data will be generated.
                            [default: 50]
    --revert-radius=<revs>  The number of revisions back that a revert can
                            reference. [default: 15]
    --min-persisted=<num>   The minimum number of revisions a token must
                            survive before being considered "persisted"
                            [default: 5]
    --min-visible=<days>    The minimum amount of time a token must survive
                            before being considered "persisted" (in days)
                            [default: 14]
    --include=<regex>       A regex matching tokens to include
                            [default: <all>]
    --exclude=<regex>       A regex matching tokens to exclude
                            [default: <none>]
    --keep-text             If set, the 'text' field will be populated in
                            the output JSON.
    --keep-diff             If set, the 'diff' field will be populated in
                            the output JSON.
    --keep-tokens           If set, the 'tokens' field will be populated in
                            the output JSON.
    --threads=<num>         If a collection of files are provided, how many
                            processor threads should be prepare?
                            [default: <cpu_count>]
    --output=<path>         Write output to a directory with one output
                            file per input path.  [default: <stdout>]
    --compress=<type>       If set, output written to the output-dir will
                            be compressed in this format. [default: bz2]
    --verbose               Print progress information to stderr.
    --debug                 Print debug logging to stderr.

mwpersistence revdocs2stats

$ mwpersistence revdocs2stats -h

Full pipeline from JSON revision documents to content persistence
statistics.

Usage:
    revdocs2stats (-h|--help)
    revdocs2stats [<input-file>...] --config=<path> --sunset=<date>
                  [--namespaces=<ids>] [--timeout=<secs>]
                  [--window=<revs>] [--revert-radius=<revs>]
                  [--min-persisted=<num>] [--min-visible=<days>]
                  [--include=<regex>] [--exclude=<regex>]
                  [--keep-text] [--keep-diff] [--keep-tokens]
                  [--threads=<num>] [--output=<path>] [--compress=<type>]
                  [--verbose] [--debug]

Options:
    -h|--help               Print this documentation
    <input-file>            The path to a file of page-partitioned JSON
                            revision documents. [default: <stdin>]
    --config=<path>         The path to a deltas DiffEngine configuration
    --namespaces=<ids>      A comma separated list of namespace IDs to be
                            considered [default: <all>]
    --timeout=<secs>        The maximum number of seconds that a diff will
                            be allowed to run before being stopped
                            [default: 10]
    --sunset=<date>         The date of the database dump we are generating
                            from.  This is used to apply a 'time visible'
                            statistic.  Expects %Y-%m-%dT%H:%M:%SZ".
                            [default: <now>]
    --window=<revs>         The size of the window of revisions from which
                            persistence data will be generated.
                            [default: 50]
    --revert-radius=<revs>  The number of revisions back that a revert can
                            reference. [default: 15]
    --min-persisted=<num>   The minimum number of revisions a token must
                            survive before being considered "persisted"
                            [default: 5]
    --min-visible=<days>    The minimum amount of time a token must survive
                            before being considered "persisted" (in days)
                            [default: 14]
    --include=<regex>       A regex matching tokens to include
                            [default: <all>]
    --exclude=<regex>       A regex matching tokens to exclude
                            [default: <none>]
    --keep-text             If set, the 'text' field will be populated in
                            the output JSON.
    --keep-diff             If set, the 'diff' field will be populated in
                            the output JSON.
    --keep-tokens           If set, the 'tokens' field will be populated in
                            the output JSON.
    --threads=<num>         If a collection of files are provided, how many
                            processor threads should be prepare?
                            [default: <cpu_count>]
    --output=<path>         Write output to a directory with one output
                            file per input path.  [default: <stdout>]
    --compress=<type>       If set, output written to the output-dir will
                            be compressed in this format. [default: bz2]
    --verbose               Print progress information to stderr.
    --debug                 Print debug logging to stderr.