Extractors

This module provides a set of wikiclass.Extractor s that implement a strategy for identifying article quality labeling events historically. These labelings are used as training data to build prediction models.

Supported wikis

wikiclass.extractors.enwiki

This extractor looks for instances of templates that contain “class=<some class>” on article talk pages (namespace = 1) and parses the template name to obtain a project.

wikiclass.extractors.itwiki

This extractor looks for instances of the “wikiprojet” template on article talk pages (namespace = 1) with a parameter called “avancement”. All `project`s are hard-coded to “wikiprojet”

Base classes

class wikiclass.Extractor(name, doc, namespaces)

Implements an labeling event extraction strategy.

Parameters:
name : str

A name for the extraction strategy

doc : str

Documentation describing the extraction strategy

namespace : iterable`(`int)

A set of namespaces that will be considered when performin an extraction

extract(page, verbose=False)

Processes an mw.xml_dump.Page and returns a generator of first-observations of a project/label pair.

Parameters:
page : mw.xml_dump.Page

Page to process

verbose : bool

print dots to stderr

class wikiclass.TemplateExtractor(*args, from_template, **kwargs)

Implements a template-based extraction strategy based on a from_template function that takes a template and returns a (project, label) pair.

Parameters:
from_template : func

A function that takes a template and returns a (project, label) pair

extract(page, verbose=False)

Processes an mw.xml_dump.Page and returns a generator of first-observations of a project/label pair.

Parameters:
page : mw.xml_dump.Page

Page to process

verbose : bool

print dots to stderr

extract_labels(text)

Extracts a set of labels for a version of text by parsing templates.

Parameters:
text : str

Wikitext markup to extract labels from

Returns:

An iterator over (project, label) pairs

Table Of Contents

Previous topic

Utilities

This Page