Utilities

This module implements a set of utilities for extracting labeling events, text and features from the command-line. When the wikiclass python package is installed, a wikiclass utility should be available from the commandline. Run revscoring -h for more information:

wikiclass extract_features

$ wikclass extract_features -h

Extracts features from a labeling doc containing text and a label and writes a
TSV of <feature>[TAB]<feature>[TAB]...<label> that is compatible with
`revscoring`'s train_test utility.

Input: { ... "label": ..., "text": ..., ... }

Output: <feature>[TAB]<feature>[TAB]...<label>


Usage:
    extract_features -h | --help
    extract_features <features> [--labelings=<path>]
                                [--value-labels=<path>]
                                [--verbose]

Options:
    -h --help                Print this documentation
    <features>               Classpath to a list/tuple of features
    --language=<classpath>   Classpath to a Language
    --labelings=<path>       Path to a file containing labeling docs pairs
                             [default: <stdin>]
    --value-labels=<path>    Path to a file to write feature value-labels to
                             [default: <stdout>]
    --verbose                Print logging information

wikiclass extract_labelings

$ wikclass extract_labelings -h

Extracts labels from an XML dump and writes out labeled observations for
each change in assessment class.  Will match extraction method to the dump.

Usage:
    extract_labelings <dump-file>... [--extractor=<name>] [--threads=<num>]
                                     [--output=<path>] [--verbose]
    extract_labelings -h | --help

Options:
    -h --help           Show this screen.
    <dump-file>         An XML dump file to process
    --extractor=<name>  The dbname of the wiki extractor to use (e.g. 'enwiki')
                        [default: <match>]
    --threads=<num>     If a collection of files are provided, how many
                        processor threads should be prepare?
                        [default: <cpu_count>]
    --output=<path>     The path to a file to dump observations to
                        [default: <stdout>]
    --verbose           Prints dots to <stderr>

wikiclass extract_text

$ wikclass extract_text -h

Extracts text & metadata for labelings using XML dumps.

Usage:
    extract_text <dump-file>... [--labelings=<path>] [--output=<path>]
                                [--threads=<num>] [--verbose]
    extract_text -h | --help

Options:
    -h --help           Show this screen.
    <dump-file>         An XML dump file to process
    --labelings=<name>  The path to a file containing labeling events.
                        [default: <stdin>]
    --output=<path>     The path to a file to dump observations to
                        [default: <stdout>]
    --threads=<num>     If a collection of files are provided, how many
                        processor threads should be prepare?
                        [default: <cpu_count>]
    --verbose           Prints dots to <stderr>

wikiclass fetch_text

$ wikclass fetch_text -h

Fetches text & metadata for labelings using a MediaWiki API.

Usage:
    fetch_text --api=<url> [--labelings=<path>] [--output=<path>] [--verbose]
    fetch_text -h | --help

Options:
    -h --help           Show this documentation.
    --api-host=<url>    The url of a MediaWiki API e.g.
                        "https://en.wikipedia.org/w/api.php"
    --labelings=<path>  Path to a containting observations with extracted
                        labels. [default: <stdin>]
    --output=<path>     Path to a file to write new observations
                        (with text) out to. [default: <stdout>]
    --verbose           Prints dots and stuff to stderr

wikiclass score

$ wikclass score -h

Applies a scoring model to a chunch of text.

Usage:
    score <model-file> [<text>]
    score -h | --help

Options:
    -h --help     Prints this documentation
    <model-file>  The path to a scorer_model file to use
    <text>        The path to a file containing text to score
                  [default: <stdin>]

Table Of Contents

Previous topic

Functions

Next topic

Extractors

This Page