Processing functions¶
-
mwpersistence.utilities.
diffs2persistence
(rev_docs, window_size=50, revert_radius=15, sunset=None, verbose=False)[source]¶ Processes a sorted and page-partitioned sequence of revision documents into and adds a ‘persistence’ field to them containing statistics about how each token “added” in the revision persisted through future revisions.
Parameters: - rev_docs : iterable ( dict )
JSON documents of revision data containing a ‘diff’ field as generated by
dump2diffs
. It’s assumed that rev_docs are partitioned by page and otherwise in chronological order.- window_size : int
The size of the window of revisions from which persistence data will be generated.
- revert_radius : int
The number of revisions back that a revert can reference.
- sunset :
mwtypes.Timestamp
The date of the database dump we are generating from. This is used to apply a ‘time visible’ statistic. If not set, now() will be assumed.
- keep_diff : bool
Do not drop the diff field from the revision document after processing is complete.
- verbose : bool
Prints out dots and stuff to stderr
Returns: A generator of rev_docs with a ‘persistence’ field containing statistics about individual tokens.
-
mwpersistence.utilities.
persistence2stats
(rev_docs, min_persisted=5, min_visible=1209600, include=None, exclude=None, verbose=False)[source]¶ Processes a sorted and page-partitioned sequence of revision documents into and adds statistics to the ‘persistence’ field each token “added” in the revision persisted through future revisions.
Parameters: - rev_docs : iterable ( dict )
JSON documents of revision data containing a ‘diff’ field as generated by
dump2diffs
. It’s assumed that rev_docs are partitioned by page and otherwise in chronological order.- window_size : int
The size of the window of revisions from which persistence data will be generated.
- min_persisted : int
The minimum future revisions that a token must persist in order to be considered “persistent”.
- min_visible : int
The minimum number of seconds that a token must be visible in order to be considered “persistent”.
- include : func
A function that returns True when a token should be included in statistical processing
- exclude : str | re.SRE_Pattern
A function that returns True when a token should not be included in statistical processing (Takes precedence over ‘include’)
- verbose : bool
Prints out dots and stuff to stderr
Returns: A generator of rev_docs with a ‘persistence’ field containing statistics about individual tokens.