Command-line utilities

This module implements a set of utilities for generating diffs and content persistence, statistics from the command-line. When the mwpersistence python package is installed, an mwpersistence utility should be available from the command-line. Run mwpersistence -h for more information:

mwpersistence diffs2persistence

$ mwpersistence diffs2persistence -h

Generates token persistence information from JSON revision documents
annotated with diff information (see `mwdiffs dump2diffs|revdocs2diffs`).

This utility expects to be fed revision documents in as a page-partitioned
chronological sequence so that diffs can be computed upon in order.

This utility uses a processing 'window' to limit memory usage.  New
revisions enter the head of the window and old revisions fall off the tail.
Stats are generated at the tail of the window.


    revisions ========[=============]=============>

                    /                \
                [tail]              [head]

    diffs2persistence (-h|--help)
    diffs2persistence [<input-file>...] --sunset=<date>
                      [--window=<revs>] [--revert-radius=<revs>]
                      [--keep-diff] [--threads=<num>] [--output=<path>]
                      [--compress=<type>] [--verbose] [--debug]

    -h|--help               Prints this documentation
    <input-file>            The path a file containing page-partitioned
                            JSON revision documents with a 'diff' field to
    --sunset=<date>         The date of the database dump we are generating
                            from.  This is used to apply a 'time visible'
                            statistic.  Expects %Y-%m-%dT%H:%M:%SZ".
                            [default: <now>]
    --window=<revs>         The size of the window of revisions from which
                            persistence data will be generated.
                            [default: 50]
    --revert-radius=<revs>  The number of revisions back that a revert can
                            reference. [default: 15]
    --keep-diff             Do not drop 'diff' field data from the json
    --threads=<num>         If a collection of files are provided, how many
                            processor threads should be prepare?
                            [default: <cpu_count>]
    --output=<path>         Write output to a directory with one output
                            file per input path.  [default: <stdout>]
    --compress=<type>       If set, output written to the output-dir will
                            be compressed in this format. [default: bz2]
    --verbose               Print dots and stuff to stderr
    --debug                 Print debug logging to stderr.

mwpersistence persistence2stats

$ mwpersistence persistence2stats -h

Generates revision-level statistics from a sequence of token persistence
infused revision documents into revision statistics.

    persistence2stats (-h | --help)
    persistence2stats [<input-file>...] [--min-persisted=<num>]
                      [--min-visible=<days>] [--include=<regex>]
                      [--exclude=<regex>] [--keep-tokens] [--threads=<num>]
                      [--output=<path>] [--compress=<type>] [--verbose]

    -h --help              Print this documentation
    <input-file>           The path to a file containing persistence data.
                           [default: <stdin>]
    --min-persisted=<num>  The minimum number of revisions a token must
                           survive before being considered "persisted"
                           [default: 5]
    --min-visible=<days>   The minimum amount of time a token must survive
                           before being considered "persisted" (in days)
                           [default: 14]
    --include=<regex>      A regex matching tokens to include (case
                           insensitive) [default: <all>]
    --exclude=<regex>      A regex matching tokens to exclude (case
                           insensitive) [default: <none>]
    --keep-tokens          Do not drop 'tokens' field data from the JSON
    --threads=<num>        If a collection of files are provided, how many
                           processor threads should be prepare?
                           [default: <cpu_count>]
    --output=<path>        Write output to a directory with one output file
                           per input path.  [default: <stdout>]
    --compress=<type>      If set, output written to the output-dir will be
                           compressed in this format. [default: bz2]
    --verbose              Print out progress information
    --debug                Print debug logging to stderr.

mwpersistence dump2stats

$ mwpersistence dump2stats -h

Full pipeline from MediaWiki XML dumps to content persistence statistics.

    dump2stats (-h|--help)
    dump2stats [<input-file>...] --config=<path> --sunset=<date>
               [--namespaces=<ids>] [--timeout=<secs>]
               [--window=<revs>] [--revert-radius=<revs>]
               [--min-persisted=<num>] [--min-visible=<days>]
               [--include=<regex>] [--exclude=<regex>]
               [--keep-text] [--keep-diff] [--keep-tokens]
               [--threads=<num>] [--output=<path>] [--compress=<type>]
               [--verbose] [--debug]

    -h|--help               Print this documentation
    <input-file>            The path to a MediaWiki XML Dump file
                            [default: <stdin>]
    --config=<path>         The path to a deltas DiffEngine configuration
    --namespaces=<ids>      A comma separated list of namespace IDs to be
                            considered [default: <all>]
    --timeout=<secs>        The maximum number of seconds that a diff will
                            be allowed to run before being stopped
                            [default: 10]
    --sunset=<date>         The date of the database dump we are generating
                            from.  This is used to apply a 'time visible'
                            statistic.  Expects %Y-%m-%dT%H:%M:%SZ".
                            [default: <now>]
    --window=<revs>         The size of the window of revisions from which
                            persistence data will be generated.
                            [default: 50]
    --revert-radius=<revs>  The number of revisions back that a revert can
                            reference. [default: 15]
    --min-persisted=<num>   The minimum number of revisions a token must
                            survive before being considered "persisted"
                            [default: 5]
    --min-visible=<days>    The minimum amount of time a token must survive
                            before being considered "persisted" (in days)
                            [default: 14]
    --include=<regex>       A regex matching tokens to include
                            [default: <all>]
    --exclude=<regex>       A regex matching tokens to exclude
                            [default: <none>]
    --keep-text             If set, the 'text' field will be populated in
                            the output JSON.
    --keep-diff             If set, the 'diff' field will be populated in
                            the output JSON.
    --keep-tokens           If set, the 'tokens' field will be populated in
                            the output JSON.
    --threads=<num>         If a collection of files are provided, how many
                            processor threads should be prepare?
                            [default: <cpu_count>]
    --output=<path>         Write output to a directory with one output
                            file per input path.  [default: <stdout>]
    --compress=<type>       If set, output written to the output-dir will
                            be compressed in this format. [default: bz2]
    --verbose               Print progress information to stderr.
    --debug                 Print debug logging to stderr.

mwpersistence revdocs2stats

$ mwpersistence revdocs2stats -h

Full pipeline from JSON revision documents to content persistence

    revdocs2stats (-h|--help)
    revdocs2stats [<input-file>...] --config=<path> --sunset=<date>
                  [--namespaces=<ids>] [--timeout=<secs>]
                  [--window=<revs>] [--revert-radius=<revs>]
                  [--min-persisted=<num>] [--min-visible=<days>]
                  [--include=<regex>] [--exclude=<regex>]
                  [--keep-text] [--keep-diff] [--keep-tokens]
                  [--threads=<num>] [--output=<path>] [--compress=<type>]
                  [--verbose] [--debug]

    -h|--help               Print this documentation
    <input-file>            The path to a file of page-partitioned JSON
                            revision documents. [default: <stdin>]
    --config=<path>         The path to a deltas DiffEngine configuration
    --namespaces=<ids>      A comma separated list of namespace IDs to be
                            considered [default: <all>]
    --timeout=<secs>        The maximum number of seconds that a diff will
                            be allowed to run before being stopped
                            [default: 10]
    --sunset=<date>         The date of the database dump we are generating
                            from.  This is used to apply a 'time visible'
                            statistic.  Expects %Y-%m-%dT%H:%M:%SZ".
                            [default: <now>]
    --window=<revs>         The size of the window of revisions from which
                            persistence data will be generated.
                            [default: 50]
    --revert-radius=<revs>  The number of revisions back that a revert can
                            reference. [default: 15]
    --min-persisted=<num>   The minimum number of revisions a token must
                            survive before being considered "persisted"
                            [default: 5]
    --min-visible=<days>    The minimum amount of time a token must survive
                            before being considered "persisted" (in days)
                            [default: 14]
    --include=<regex>       A regex matching tokens to include
                            [default: <all>]
    --exclude=<regex>       A regex matching tokens to exclude
                            [default: <none>]
    --keep-text             If set, the 'text' field will be populated in
                            the output JSON.
    --keep-diff             If set, the 'diff' field will be populated in
                            the output JSON.
    --keep-tokens           If set, the 'tokens' field will be populated in
                            the output JSON.
    --threads=<num>         If a collection of files are provided, how many
                            processor threads should be prepare?
                            [default: <cpu_count>]
    --output=<path>         Write output to a directory with one output
                            file per input path.  [default: <stdout>]
    --compress=<type>       If set, output written to the output-dir will
                            be compressed in this format. [default: bz2]
    --verbose               Print progress information to stderr.
    --debug                 Print debug logging to stderr.