Algorithms¶

The primary use-case of this library is to detect differences between two sequences of tokens. So far, two such algorithmic strategies are available:

sequence_matcher: implementes diff() that will compare two sequences of Token and return a set of operations.
segment_matcher: implementes diff() that uses a Segmenter to detect block moves

Both of these algorithms are supplimented with a deltas.DiffEngine for efficiently processing several revisions of the same text

Implemented Algorithms¶

Segment Matcher¶

Performs a diffs using a tree of matchable segments in order to remain robust to content moves. This module supports the use of a custom Segmenter.

deltas.algorithms.segment_matcher.diff(a, b, segmenter=None)¶

Performs a diff comparison between two sequences of tokens (a and b) using segmenter to cluster and match deltas.MatchableSegment.

Example:

>>> from deltas import segment_matcher, text_split
>>>
>>> a = text_split.tokenize("This is some text.  This is some other text.")
>>> b = text_split.tokenize("This is some other text.  This is some text.")
>>> operations = segment_matcher.diff(a, b)
>>>
>>> for op in operations:
...     print(op.name, repr(''.join(a[op.a1:op.a2])),
...           repr(''.join(b[op.b1:op.b2])))
...
equal 'This is some other text.' 'This is some other text.'
insert '' '  '
equal 'This is some text.' 'This is some text.'
delete '  ' ''

Parameters:

a : list`(:class:`deltas.tokenizers.Token): Initial sequence
b : list`(:class:`deltas.tokenizers.Token): Changed sequence
segmenter : deltas.Segmenter: A segmenter to use on the tokens.

Returns:

An iterable of operations.

deltas.algorithms.segment_matcher.diff_segments(a_segments, b_segments)¶

Performs a diff comparison between two pre-clustered deltas.Segment trees. In most cases, segmentation takes 100X more time than actually performing the diff.

Parameters:	a_segments : `deltas.Segment` An initial sequence b_segments : `deltas.Segment` A changed sequence
Returns:	An iterable of operations.

deltas.algorithms.segment_matcher.process(texts, *args, **kwargs)¶

Processes a single sequence of texts with a SegmentMatcher.

Parameters:	texts : iterable`(`str) sequence of texts args : tuple passed to `SegmentMatcher`‘s constructor kwaths : dict passed to `SegmentMatcher`‘s constructor

class deltas.SegmentMatcher(tokenizer=None, segmenter=None)¶

Constructs a segment matcher diff engine that preserves segmentation state and is able to process changes sequentially. When detecting changes across many versions of a text this strategy will be about twice as fast as calling diff() sequentially.

Example:

>>> from deltas import SegmentMatcher
>>> from deltas import text_split
>>>
>>> engine = SegmentMatcher(text_split)
>>>
>>> processor = engine.processor()
>>> ops, a, b = processor.process("This is a version.  It has some " +
...                               "text in it.")
>>> print(" ".join(repr(''.join(b[op.b1:op.b2])) for op in ops))
'This is a version.  It has some text in it.'
>>> ops, a, b = processor.process("This is a version.  However, it " +
...                               "has different.")
>>> print(" ".join(repr(''.join(b[op.b1:op.b2])) for op in ops))
'This is a version.  ' '' 'However, it' ' has ' '' 'different' '.'
>>> ops, a, b = processor.process("Switching it up here.  This is a " +
...                               "version.")
>>> print(" ".join(repr(''.join(b[op.b1:op.b2])) for op in ops))
'' 'Switching' ' it ' '' 'up' ' ' '' 'here' '.' '  ' 'This is a version.'

class Processor(tokenizer=None, segmenter=None, last_text=None, last_tokens=None, last_segments=None)¶

A processor used by the SegmentMatcher difference engine to track the history of a single text.

process(text, token_class=<class 'deltas.tokenizers.token.Token'>)¶

Processes a new version of a text and returns the delta.

Parameters:	text : str The text to process
Returns:	A tuple of operations, a_tokens, b_tokens

SegmentMatcher.processor(*args, **kwargs)¶: Constructs and configures a processor to process versions of a text.

Sequence Matcher¶

Performs a simple longest-common-substring diff. This module implements a simple wrapper around difflib.SequenceMatcher.

deltas.algorithms.sequence_matcher.diff(a, b)¶

Performs a longest common substring diff.

Parameters:	a : sequence of comparable Initial sequence b : sequence of comparable Changed sequence
Returns:	An iterable of operations.

deltas.algorithms.sequence_matcher.process(texts, *args, **kwargs)¶

class deltas.SequenceMatcher(tokenizer=None)¶

Constructs a sequence matching diff engine that preserves verion state and is able to process changes sequentially. When detecting changes across many versions of a text this strategy will be about twice as fast as calling diff() sequentially.

Example:

>>> from deltas.algorithms import SequenceMatcher
>>> engine = SequenceMatcher()
>>>
>>> processor = engine.processor()
>>> ops, a, b = processor.process("This is a version.  It has some " +
...                               "text in it.")
>>> print(" ".join(repr(''.join(b[op.b1:op.b2])) for op in ops))
'This is a version.  It has some text in it.'
>>> ops, a, b = processor.process("This is a version.  However, it " +
...                               "has different.")
>>> print(" ".join(repr(''.join(b[op.b1:op.b2])) for op in ops))
'This is a version.  ' '' 'However, it' ' has ' '' 'different' '.'
>>> ops, a, b = processor.process("Switching it up here.  This is " +
...                               "a version.")
>>> print(" ".join(repr(''.join(b[op.b1:op.b2])) for op in ops))
'Switching it up here.  ' 'This is a version.' ''

Diff engine¶

A deltas.DiffEngine implements a streaming diff strategy that applies a diff algorithm to a sequence of text revisions. By maintaining an internal state, various efficiencies can be implemented.

class deltas.DiffEngine¶

Constructs a diff engine.

class Processor¶: Constructs a new diff processor for processing many versions of a single text.

classmethod DiffEngine.from_config(config, name, section_key='diff_engines')¶: Constructs a deltas.DiffEngine from a configuration doc.

DiffEngine.processor()¶: Configures and returns a new Processor

Algorithms¶

Implemented Algorithms¶

Segment Matcher¶

Sequence Matcher¶

Sequence Matcher¶

Diff engine¶

Table Of Contents

Related Topics

This Page