This module implements a set of utilities for extracting labeling events, text and features from the command-line. When the wikiclass python package is installed, a wikiclass utility should be available from the commandline. Run revscoring -h for more information:
$ wikclass extract_features -h
Extracts features from a labeling doc containing text and a label and writes a
TSV of <feature>[TAB]<feature>[TAB]...<label> that is compatible with
`revscoring`'s train_test utility.
Input: { ... "label": ..., "text": ..., ... }
Output: <feature>[TAB]<feature>[TAB]...<label>
Usage:
extract_features -h | --help
extract_features <features> [--labelings=<path>]
[--value-labels=<path>]
[--verbose]
Options:
-h --help Print this documentation
<features> Classpath to a list/tuple of features
--language=<classpath> Classpath to a Language
--labelings=<path> Path to a file containing labeling docs pairs
[default: <stdin>]
--value-labels=<path> Path to a file to write feature value-labels to
[default: <stdout>]
--verbose Print logging information
$ wikclass extract_labelings -h
Extracts labels from an XML dump and writes out labeled observations for
each change in assessment class. Will match extraction method to the dump.
Usage:
extract_labelings <dump-file>... [--extractor=<name>] [--threads=<num>]
[--output=<path>] [--verbose]
extract_labelings -h | --help
Options:
-h --help Show this screen.
<dump-file> An XML dump file to process
--extractor=<name> The dbname of the wiki extractor to use (e.g. 'enwiki')
[default: <match>]
--threads=<num> If a collection of files are provided, how many
processor threads should be prepare?
[default: <cpu_count>]
--output=<path> The path to a file to dump observations to
[default: <stdout>]
--verbose Prints dots to <stderr>
$ wikclass extract_text -h
Extracts text & metadata for labelings using XML dumps.
Usage:
extract_text <dump-file>... [--labelings=<path>] [--output=<path>]
[--threads=<num>] [--verbose]
extract_text -h | --help
Options:
-h --help Show this screen.
<dump-file> An XML dump file to process
--labelings=<name> The path to a file containing labeling events.
[default: <stdin>]
--output=<path> The path to a file to dump observations to
[default: <stdout>]
--threads=<num> If a collection of files are provided, how many
processor threads should be prepare?
[default: <cpu_count>]
--verbose Prints dots to <stderr>
$ wikclass fetch_text -h
Fetches text & metadata for labelings using a MediaWiki API.
Usage:
fetch_text --api=<url> [--labelings=<path>] [--output=<path>] [--verbose]
fetch_text -h | --help
Options:
-h --help Show this documentation.
--api-host=<url> The url of a MediaWiki API e.g.
"https://en.wikipedia.org/w/api.php"
--labelings=<path> Path to a containting observations with extracted
labels. [default: <stdin>]
--output=<path> Path to a file to write new observations
(with text) out to. [default: <stdout>]
--verbose Prints dots and stuff to stderr
$ wikclass score -h
Applies a scoring model to a chunch of text.
Usage:
score <model-file> [<text>]
score -h | --help
Options:
-h --help Prints this documentation
<model-file> The path to a scorer_model file to use
<text> The path to a file containing text to score
[default: <stdin>]