Algorithms

The primary use-case of this library is to detect differences between two sequences of tokens. So far, two such algorithmic strategies are available:

sequence_matcher
implementes diff() that will compare two sequences of Token and return a set of operations.
segment_matcher
implementes diff() that uses a Segmenter to detect block moves

Both of these algorithms are supplimented with a deltas.DiffEngine for efficiently processing several revisions of the same text

Implemented Algorithms

Segment Matcher

Performs a diffs using a tree of matchable segments in order to remain robust to content moves. This module supports the use of a custom Segmenter.

deltas.algorithms.segment_matcher.diff(a, b, segmenter=None)

Performs a diff comparison between two sequences of tokens (a and b) using segmenter to cluster and match deltas.MatchableSegment.

Example:
>>> from deltas import segment_matcher, text_split
>>>
>>> a = text_split.tokenize("This is some text.  This is some other text.")
>>> b = text_split.tokenize("This is some other text.  This is some text.")
>>> operations = segment_matcher.diff(a, b)
>>>
>>> for op in operations:
...     print(op.name, repr(''.join(a[op.a1:op.a2])),
...           repr(''.join(b[op.b1:op.b2])))
...
equal 'This is some other text.' 'This is some other text.'
insert '' '  '
equal 'This is some text.' 'This is some text.'
delete '  ' ''
Parameters:
a : list`(:class:`deltas.tokenizers.Token)

Initial sequence

b : list`(:class:`deltas.tokenizers.Token)

Changed sequence

segmenter : deltas.Segmenter

A segmenter to use on the tokens.

Returns:

An iterable of operations.

deltas.algorithms.segment_matcher.diff_segments(a_segments, b_segments)

Performs a diff comparison between two pre-clustered deltas.Segment trees. In most cases, segmentation takes 100X more time than actually performing the diff.

Parameters:
a_segments : deltas.Segment

An initial sequence

b_segments : deltas.Segment

A changed sequence

Returns:

An iterable of operations.

deltas.algorithms.segment_matcher.process(texts, *args, **kwargs)

Processes a single sequence of texts with a SegmentMatcher.

Parameters:
texts : iterable`(`str)

sequence of texts

args : tuple

passed to SegmentMatcher‘s constructor

kwaths : dict

passed to SegmentMatcher‘s constructor

class deltas.SegmentMatcher(tokenizer=None, segmenter=None)

Constructs a segment matcher diff engine that preserves segmentation state and is able to process changes sequentially. When detecting changes across many versions of a text this strategy will be about twice as fast as calling diff() sequentially.

Example:
>>> from deltas import SegmentMatcher
>>> from deltas import text_split
>>>
>>> engine = SegmentMatcher(text_split)
>>>
>>> processor = engine.processor()
>>> ops, a, b = processor.process("This is a version.  It has some " +
...                               "text in it.")
>>> print(" ".join(repr(''.join(b[op.b1:op.b2])) for op in ops))
'This is a version.  It has some text in it.'
>>> ops, a, b = processor.process("This is a version.  However, it " +
...                               "has different.")
>>> print(" ".join(repr(''.join(b[op.b1:op.b2])) for op in ops))
'This is a version.  ' '' 'However, it' ' has ' '' 'different' '.'
>>> ops, a, b = processor.process("Switching it up here.  This is a " +
...                               "version.")
>>> print(" ".join(repr(''.join(b[op.b1:op.b2])) for op in ops))
'' 'Switching' ' it ' '' 'up' ' ' '' 'here' '.' '  ' 'This is a version.'
class Processor(tokenizer=None, segmenter=None, last_text=None, last_tokens=None, last_segments=None)

A processor used by the SegmentMatcher difference engine to track the history of a single text.

process(text, token_class=<class 'deltas.tokenizers.token.Token'>)

Processes a new version of a text and returns the delta.

Parameters:
text : str

The text to process

Returns:

A tuple of operations, a_tokens, b_tokens

SegmentMatcher.processor(*args, **kwargs)

Constructs and configures a processor to process versions of a text.

Sequence Matcher

Sequence Matcher

Performs a simple longest-common-substring diff. This module implements a simple wrapper around difflib.SequenceMatcher.

deltas.algorithms.sequence_matcher.diff(a, b)

Performs a longest common substring diff.

Parameters:
a : sequence of comparable

Initial sequence

b : sequence of comparable

Changed sequence

Returns:

An iterable of operations.

deltas.algorithms.sequence_matcher.process(texts, *args, **kwargs)
class deltas.SequenceMatcher(tokenizer=None)

Constructs a sequence matching diff engine that preserves verion state and is able to process changes sequentially. When detecting changes across many versions of a text this strategy will be about twice as fast as calling diff() sequentially.

Example:
>>> from deltas.algorithms import SequenceMatcher
>>> engine = SequenceMatcher()
>>>
>>> processor = engine.processor()
>>> ops, a, b = processor.process("This is a version.  It has some " +
...                               "text in it.")
>>> print(" ".join(repr(''.join(b[op.b1:op.b2])) for op in ops))
'This is a version.  It has some text in it.'
>>> ops, a, b = processor.process("This is a version.  However, it " +
...                               "has different.")
>>> print(" ".join(repr(''.join(b[op.b1:op.b2])) for op in ops))
'This is a version.  ' '' 'However, it' ' has ' '' 'different' '.'
>>> ops, a, b = processor.process("Switching it up here.  This is " +
...                               "a version.")
>>> print(" ".join(repr(''.join(b[op.b1:op.b2])) for op in ops))
'Switching it up here.  ' 'This is a version.' ''

Diff engine

A deltas.DiffEngine implements a streaming diff strategy that applies a diff algorithm to a sequence of text revisions. By maintaining an internal state, various efficiencies can be implemented.

class deltas.DiffEngine

Constructs a diff engine.

class Processor

Constructs a new diff processor for processing many versions of a single text.

classmethod DiffEngine.from_config(config, name, section_key='diff_engines')

Constructs a deltas.DiffEngine from a configuration doc.

DiffEngine.processor()

Configures and returns a new Processor