Revision Scoring

This library contains a set of facilities for constructing and applying ScorerModel s to MediaWiki revisions. This library eases the training and testing of Machine Learning-based scoring strategies.

Key Features

Scorer Models

ScorerModel are the core of the revscoring system. Provide a simple interface with complex internals. Most commonly, a revscoring.scorer_models.MLScorerModel (Machine Learned) is train()‘d and test()‘d on labeled data to provide a basis for scoring. We currently support Support Vector Classifier, Random Forest, and Naive Bayes type models. See revscoring.scorer_models

Example:
>>> import mwapi
>>> from revscoring import ScorerModel
>>> from revscoring.extractors import api
>>>
>>> with open("models/enwiki.damaging.linear_svc.model") as f:
...     model = ScorerModel.load(f)
...
>>> extractor = api.Extractor(mwapi.Session(host="https://en.wikipedia.org",
...                                         user_agent="revscoring demo"))
>>> values = extractor.extract(123456789, model.features)
>>> print(model.score(values))
{'prediction': True,
 'probability': {False: 0.4694409344514984,
                 True: 0.5305590655485017}}

Feature extraction

Revscoring provides a dependency-injection-based feature extraction framework that allows new features to be built on top of old. This allows a powerful means to expressing new features and a simple way to address efficiency concerns. See revscoring.features, revscoring.datasources, and revscoring.extractors

Example:

>>> from mwapi import Session
>>> from revscoring.extractors import api
>>> from revscoring.features import temporal, wikitext
>>>
>>> session = Session("https://en.wikipedia.org/w/api.php", user_agent="test")
>>> api_extractor = api.Extractor(session)
>>>
>>> features = [temporal.revision.day_of_week,
...             temporal.revision.hour_of_day,
...             wikitext.revision.parent.headings_by_level(2)]
>>>
>>> values = api_extractor.extract(624577024, features)
>>> for feature, value in zip(features, values):
...     print("     {0}: {1}".format(feature, repr(value)))
...
    <temporal.revision.day_of_week>: 6
    <temporal.revision.hour_of_day>: 19
    <wikitext.revision.parent.headings_by_level(2)>: 5

Language support

Many features require language specific utilities to be available to support feature extraction. In order to support this, we provide a collection of language feature sets that work like other features except that they are language-specific. Language-specific feature sets are available for the following languages: arabic, dutch, english, estonian, french, german, hebrew, indonesian, italian, persian, portuguese, spanish, turkish, ukrainian, and vietnamese. See revscoring.languages

Example:

>>> from revscoring.datasources.revision_oriented import revision
>>> from revscoring.dependencies import solve
>>> from revscoring.languages import english, spanish
>>>
>>> features = [english.informals.revision.matches,
...              spanish.informals.revision.matches]
>>> values = solve(features, cache={revision.text: "I think it is stupid."})
>>>
>>> for feature, value in zip(features, values):
...     print("     {0}: {1}".format(feature, repr(value)))
...
    <len(<english.informals.revision.matches>)>: 2
    <len(<spanish.informals.revision.matches>)>: 0

Indices and tables

Revision Scoring

Navigation

Related Topics