mw.lib.persistence – tracking content between revisions

class mw.lib.persistence.State(tokenize=<function wikitext_split at 0x2ad00fa191e0>, diff=<function sequence_matcher at 0x2ad00fa192f0>, revert_radius=15, revert_detector=None)

Represents the state of word persistence in a page. See https://meta.wikimedia.org/wiki/Research:Content_persistence

Parameters:
tokenize : function( str ) –> list( str )

A tokenizing function

diff : function(list( str ), list( str )) –> list( ops )

A function to perform a difference between token lists

revert_radius : int

a positive integer indicating the maximum revision distance that a revert can span.

revert_detector : mw.lib.reverts.Detector

a revert detector to start process with

Example:
>>> from pprint import pprint
>>> from mw.lib import persistence
>>>
>>> state = persistence.State()
>>>
>>> pprint(state.process("Apples are red.", revision=1))
([Token(text='Apples', revisions=[1]),
  Token(text=' ', revisions=[1]),
  Token(text='are', revisions=[1]),
  Token(text=' ', revisions=[1]),
  Token(text='red', revisions=[1]),
  Token(text='.', revisions=[1])],
 [Token(text='Apples', revisions=[1]),
  Token(text=' ', revisions=[1]),
  Token(text='are', revisions=[1]),
  Token(text=' ', revisions=[1]),
  Token(text='red', revisions=[1]),
  Token(text='.', revisions=[1])],
 [])
>>> pprint(state.process("Apples are blue.", revision=2))
([Token(text='Apples', revisions=[1, 2]),
  Token(text=' ', revisions=[1, 2]),
  Token(text='are', revisions=[1, 2]),
  Token(text=' ', revisions=[1, 2]),
  Token(text='blue', revisions=[2]),
  Token(text='.', revisions=[1, 2])],
 [Token(text='blue', revisions=[2])],
 [Token(text='red', revisions=[1])])
>>> pprint(state.process("Apples are red.", revision=3)) # A revert!
([Token(text='Apples', revisions=[1, 2, 3]),
  Token(text=' ', revisions=[1, 2, 3]),
  Token(text='are', revisions=[1, 2, 3]),
  Token(text=' ', revisions=[1, 2, 3]),
  Token(text='red', revisions=[1, 3]),
  Token(text='.', revisions=[1, 2, 3])],
 [],
 [])
process(text, revision=None, checksum=None)

Modifies the internal state based a change to the content and returns the sets of words added and removed.

Parameters:
text : str

The text content of a revision

revision : mixed

Revision meta data

checksum : str

A checksum hash of the text content (will be generated if not provided)

Returns:

Three Tokens lists

current_tokens : Tokens

A sequence of Token for the processed revision

tokens_added : Tokens

A set of tokens that were inserted by the processed revision

tokens_removed : Tokens

A sequence of Token removed by the processed revision

Tokenization

class mw.lib.persistence.Tokens(*args, **kwargs)

Represents a list of Token with some useful helper functions.

Example:
>>> from mw.lib.persistence import Token, Tokens
>>>
>>> tokens = Tokens()
>>> tokens.append(Token("foo"))
>>> tokens.extend([Token(" "), Token("bar")])
>>>
>>> tokens[0]
Token(text='foo', revisions=[])
>>>
>>> "".join(tokens.texts())
'foo bar'
class mw.lib.persistence.Token(text, revisions=None)

Represents a chunk of text and the revisions of a page that it survived.

revisions

The meta data for the revisions that the token has appeared within.

text

The text of the token.

mw.lib.persistence.tokenization.wikitext_split(text)[source]

Performs the simplest possible split of latin character-based languages and wikitext.

Parameters:
text : str

Text to split.

Difference

mw.lib.persistence.difference.sequence_matcher(old, new)[source]

Generates a sequence of operations using difflib.SequenceMatcher.

Parameters:
old : list( hashable )

Old tokens

new : list( hashable )

New tokens

Returns:
Minimal operations needed to convert old to new
mw.lib.persistence.difference.apply(ops, old, new)[source]

Applies operations (delta) to copy items from old to new.

Parameters:
ops : list((op, a1, a2, b1, b2))

Operations to perform

old : list( hashable )

Old tokens

new : list( hashable )

New tokens

Returns:

An iterator over elements matching new but copied from old

Constants

mw.lib.persistence.defaults.TOKENIZE(text)

The standard tokenizing function.

mw.lib.persistence.defaults.DIFF(old, new)

The standard diff function

Table Of Contents

Previous topic

mw.xml_dump – XML dump processing

Next topic

mw.lib.reverts – detecting reverts

This Page