Command-line utilities¶
This module implements a set of utilities for generating diffs and content persistence, statistics from the command-line. When the mwpersistence python package is installed, an mwpersistence utility should be available from the command-line. Run mwpersistence -h for more information:
mwpersistence diffs2persistence¶
$ mwpersistence diffs2persistence -h
Generates token persistence information from JSON revision documents
annotated with diff information (see `mwdiffs dump2diffs|revdocs2diffs`).
This utility expects to be fed revision documents in as a page-partitioned
chronological sequence so that diffs can be computed upon in order.
This utility uses a processing 'window' to limit memory usage. New
revisions enter the head of the window and old revisions fall off the tail.
Stats are generated at the tail of the window.
::
window
.------+------.
revisions ========[=============]=============>
/ \
[tail] [head]
Usage:
diffs2persistence (-h|--help)
diffs2persistence [<input-file>...] --sunset=<date>
[--window=<revs>] [--revert-radius=<revs>]
[--keep-diff] [--threads=<num>] [--output=<path>]
[--compress=<type>] [--verbose] [--debug]
Options:
-h|--help Prints this documentation
<input-file> The path a file containing page-partitioned
JSON revision documents with a 'diff' field to
process.
--sunset=<date> The date of the database dump we are generating
from. This is used to apply a 'time visible'
statistic. Expects %Y-%m-%dT%H:%M:%SZ".
[default: <now>]
--window=<revs> The size of the window of revisions from which
persistence data will be generated.
[default: 50]
--revert-radius=<revs> The number of revisions back that a revert can
reference. [default: 15]
--keep-diff Do not drop 'diff' field data from the json
blobs.
--threads=<num> If a collection of files are provided, how many
processor threads should be prepare?
[default: <cpu_count>]
--output=<path> Write output to a directory with one output
file per input path. [default: <stdout>]
--compress=<type> If set, output written to the output-dir will
be compressed in this format. [default: bz2]
--verbose Print dots and stuff to stderr
--debug Print debug logging to stderr.
mwpersistence persistence2stats¶
$ mwpersistence persistence2stats -h
Generates revision-level statistics from a sequence of token persistence
infused revision documents into revision statistics.
Usage:
persistence2stats (-h | --help)
persistence2stats [<input-file>...] [--min-persisted=<num>]
[--min-visible=<days>] [--include=<regex>]
[--exclude=<regex>] [--keep-tokens] [--threads=<num>]
[--output=<path>] [--compress=<type>] [--verbose]
[--debug]
Options:
-h --help Print this documentation
<input-file> The path to a file containing persistence data.
[default: <stdin>]
--min-persisted=<num> The minimum number of revisions a token must
survive before being considered "persisted"
[default: 5]
--min-visible=<days> The minimum amount of time a token must survive
before being considered "persisted" (in days)
[default: 14]
--include=<regex> A regex matching tokens to include (case
insensitive) [default: <all>]
--exclude=<regex> A regex matching tokens to exclude (case
insensitive) [default: <none>]
--keep-tokens Do not drop 'tokens' field data from the JSON
document.
--threads=<num> If a collection of files are provided, how many
processor threads should be prepare?
[default: <cpu_count>]
--output=<path> Write output to a directory with one output file
per input path. [default: <stdout>]
--compress=<type> If set, output written to the output-dir will be
compressed in this format. [default: bz2]
--verbose Print out progress information
--debug Print debug logging to stderr.
mwpersistence dump2stats¶
$ mwpersistence dump2stats -h
Full pipeline from MediaWiki XML dumps to content persistence statistics.
Usage:
dump2stats (-h|--help)
dump2stats [<input-file>...] --config=<path> --sunset=<date>
[--namespaces=<ids>] [--timeout=<secs>]
[--window=<revs>] [--revert-radius=<revs>]
[--min-persisted=<num>] [--min-visible=<days>]
[--include=<regex>] [--exclude=<regex>]
[--keep-text] [--keep-diff] [--keep-tokens]
[--threads=<num>] [--output=<path>] [--compress=<type>]
[--verbose] [--debug]
Options:
-h|--help Print this documentation
<input-file> The path to a MediaWiki XML Dump file
[default: <stdin>]
--config=<path> The path to a deltas DiffEngine configuration
--namespaces=<ids> A comma separated list of namespace IDs to be
considered [default: <all>]
--timeout=<secs> The maximum number of seconds that a diff will
be allowed to run before being stopped
[default: 10]
--sunset=<date> The date of the database dump we are generating
from. This is used to apply a 'time visible'
statistic. Expects %Y-%m-%dT%H:%M:%SZ".
[default: <now>]
--window=<revs> The size of the window of revisions from which
persistence data will be generated.
[default: 50]
--revert-radius=<revs> The number of revisions back that a revert can
reference. [default: 15]
--min-persisted=<num> The minimum number of revisions a token must
survive before being considered "persisted"
[default: 5]
--min-visible=<days> The minimum amount of time a token must survive
before being considered "persisted" (in days)
[default: 14]
--include=<regex> A regex matching tokens to include
[default: <all>]
--exclude=<regex> A regex matching tokens to exclude
[default: <none>]
--keep-text If set, the 'text' field will be populated in
the output JSON.
--keep-diff If set, the 'diff' field will be populated in
the output JSON.
--keep-tokens If set, the 'tokens' field will be populated in
the output JSON.
--threads=<num> If a collection of files are provided, how many
processor threads should be prepare?
[default: <cpu_count>]
--output=<path> Write output to a directory with one output
file per input path. [default: <stdout>]
--compress=<type> If set, output written to the output-dir will
be compressed in this format. [default: bz2]
--verbose Print progress information to stderr.
--debug Print debug logging to stderr.
mwpersistence revdocs2stats¶
$ mwpersistence revdocs2stats -h
Full pipeline from JSON revision documents to content persistence
statistics.
Usage:
revdocs2stats (-h|--help)
revdocs2stats [<input-file>...] --config=<path> --sunset=<date>
[--namespaces=<ids>] [--timeout=<secs>]
[--window=<revs>] [--revert-radius=<revs>]
[--min-persisted=<num>] [--min-visible=<days>]
[--include=<regex>] [--exclude=<regex>]
[--keep-text] [--keep-diff] [--keep-tokens]
[--threads=<num>] [--output=<path>] [--compress=<type>]
[--verbose] [--debug]
Options:
-h|--help Print this documentation
<input-file> The path to a file of page-partitioned JSON
revision documents. [default: <stdin>]
--config=<path> The path to a deltas DiffEngine configuration
--namespaces=<ids> A comma separated list of namespace IDs to be
considered [default: <all>]
--timeout=<secs> The maximum number of seconds that a diff will
be allowed to run before being stopped
[default: 10]
--sunset=<date> The date of the database dump we are generating
from. This is used to apply a 'time visible'
statistic. Expects %Y-%m-%dT%H:%M:%SZ".
[default: <now>]
--window=<revs> The size of the window of revisions from which
persistence data will be generated.
[default: 50]
--revert-radius=<revs> The number of revisions back that a revert can
reference. [default: 15]
--min-persisted=<num> The minimum number of revisions a token must
survive before being considered "persisted"
[default: 5]
--min-visible=<days> The minimum amount of time a token must survive
before being considered "persisted" (in days)
[default: 14]
--include=<regex> A regex matching tokens to include
[default: <all>]
--exclude=<regex> A regex matching tokens to exclude
[default: <none>]
--keep-text If set, the 'text' field will be populated in
the output JSON.
--keep-diff If set, the 'diff' field will be populated in
the output JSON.
--keep-tokens If set, the 'tokens' field will be populated in
the output JSON.
--threads=<num> If a collection of files are provided, how many
processor threads should be prepare?
[default: <cpu_count>]
--output=<path> Write output to a directory with one output
file per input path. [default: <stdout>]
--compress=<type> If set, output written to the output-dir will
be compressed in this format. [default: bz2]
--verbose Print progress information to stderr.
--debug Print debug logging to stderr.