Processing functions

mwpersistence.utilities.diffs2persistence(rev_docs, window_size=50, revert_radius=15, sunset=None, verbose=False)[source]

Processes a sorted and page-partitioned sequence of revision documents into and adds a ‘persistence’ field to them containing statistics about how each token “added” in the revision persisted through future revisions.

Parameters:
rev_docs : iterable ( dict )

JSON documents of revision data containing a ‘diff’ field as generated by dump2diffs. It’s assumed that rev_docs are partitioned by page and otherwise in chronological order.

window_size : int

The size of the window of revisions from which persistence data will be generated.

revert_radius : int

The number of revisions back that a revert can reference.

sunset : mwtypes.Timestamp

The date of the database dump we are generating from. This is used to apply a ‘time visible’ statistic. If not set, now() will be assumed.

keep_diff : bool

Do not drop the diff field from the revision document after processing is complete.

verbose : bool

Prints out dots and stuff to stderr

Returns:

A generator of rev_docs with a ‘persistence’ field containing statistics about individual tokens.

mwpersistence.utilities.persistence2stats(rev_docs, min_persisted=5, min_visible=1209600, include=None, exclude=None, verbose=False)[source]

Processes a sorted and page-partitioned sequence of revision documents into and adds statistics to the ‘persistence’ field each token “added” in the revision persisted through future revisions.

Parameters:
rev_docs : iterable ( dict )

JSON documents of revision data containing a ‘diff’ field as generated by dump2diffs. It’s assumed that rev_docs are partitioned by page and otherwise in chronological order.

window_size : int

The size of the window of revisions from which persistence data will be generated.

min_persisted : int

The minimum future revisions that a token must persist in order to be considered “persistent”.

min_visible : int

The minimum number of seconds that a token must be visible in order to be considered “persistent”.

include : func

A function that returns True when a token should be included in statistical processing

exclude : str | re.SRE_Pattern

A function that returns True when a token should not be included in statistical processing (Takes precedence over ‘include’)

verbose : bool

Prints out dots and stuff to stderr

Returns:

A generator of rev_docs with a ‘persistence’ field containing statistics about individual tokens.

mwpersistence.utilities.dump2stats(dump, *args, **kwargs)[source]
mwpersistence.utilities.revdocs2stats(rev_docs, diff_engine, namespaces, timeout, window_size, revert_radius, sunset, min_persisted, min_visible, include, exclude, keep_text=False, keep_diff=False, keep_tokens=False, verbose=False)[source]