# Algorithms¶

The primary use-case of this library is to detect differences between two sequences of tokens. So far, two such algorithmic strategies are available:

sequence_matcher
implementes diff() that will compare two sequences of Token and return a set of operations.
segment_matcher
implementes diff() that uses a Segmenter to detect block moves

Both of these algorithms are supplimented with a deltas.DiffEngine for efficiently processing several revisions of the same text

## Implemented Algorithms¶

### Segment Matcher¶

Performs a diffs using a tree of matchable segments in order to remain robust to content moves. This module supports the use of a custom Segmenter.

deltas.algorithms.segment_matcher.diff(a, b, segmenter=None)

Performs a diff comparison between two sequences of tokens (a and b) using segmenter to cluster and match deltas.MatchableSegment.

Example: >>> from deltas import segment_matcher, text_split >>> >>> a = text_split.tokenize("This is some text. This is some other text.") >>> b = text_split.tokenize("This is some other text. This is some text.") >>> operations = segment_matcher.diff(a, b) >>> >>> for op in operations: ... print(op.name, repr(''.join(a[op.a1:op.a2])), ... repr(''.join(b[op.b1:op.b2]))) ... equal 'This is some other text.' 'This is some other text.' insert '' ' ' equal 'This is some text.' 'This is some text.' delete ' ' ''  a : list(:class:deltas.tokenizers.Token) Initial sequence b : list(:class:deltas.tokenizers.Token) Changed sequence segmenter : deltas.Segmenter A segmenter to use on the tokens. An iterable of operations.
deltas.algorithms.segment_matcher.diff_segments(a_segments, b_segments)

Performs a diff comparison between two pre-clustered deltas.Segment trees. In most cases, segmentation takes 100X more time than actually performing the diff.

Parameters: a_segments : deltas.Segment An initial sequence b_segments : deltas.Segment A changed sequence An iterable of operations.
deltas.algorithms.segment_matcher.process(texts, *args, **kwargs)

Processes a single sequence of texts with a SegmentMatcher.

Parameters: texts : iterable(str) sequence of texts args : tuple passed to SegmentMatcher‘s constructor kwaths : dict passed to SegmentMatcher‘s constructor
class deltas.SegmentMatcher(tokenizer=None, segmenter=None)

Constructs a segment matcher diff engine that preserves segmentation state and is able to process changes sequentially. When detecting changes across many versions of a text this strategy will be about twice as fast as calling diff() sequentially.

Example: >>> from deltas import SegmentMatcher >>> from deltas import text_split >>> >>> engine = SegmentMatcher(text_split) >>> >>> processor = engine.processor() >>> ops, a, b = processor.process("This is a version. It has some " + ... "text in it.") >>> print(" ".join(repr(''.join(b[op.b1:op.b2])) for op in ops)) 'This is a version. It has some text in it.' >>> ops, a, b = processor.process("This is a version. However, it " + ... "has different.") >>> print(" ".join(repr(''.join(b[op.b1:op.b2])) for op in ops)) 'This is a version. ' '' 'However, it' ' has ' '' 'different' '.' >>> ops, a, b = processor.process("Switching it up here. This is a " + ... "version.") >>> print(" ".join(repr(''.join(b[op.b1:op.b2])) for op in ops)) '' 'Switching' ' it ' '' 'up' ' ' '' 'here' '.' ' ' 'This is a version.' 
class Processor(tokenizer=None, segmenter=None, last_text=None, last_tokens=None, last_segments=None)

A processor used by the SegmentMatcher difference engine to track the history of a single text.

process(text, token_class=<class 'deltas.tokenizers.token.Token'>)

Processes a new version of a text and returns the delta.

Parameters: text : str The text to process A tuple of operations, a_tokens, b_tokens
SegmentMatcher.processor(*args, **kwargs)

Constructs and configures a processor to process versions of a text.

### Sequence Matcher¶

#### Sequence Matcher¶

Performs a simple longest-common-substring diff. This module implements a simple wrapper around difflib.SequenceMatcher.

deltas.algorithms.sequence_matcher.diff(a, b)

Performs a longest common substring diff.

Parameters: a : sequence of comparable Initial sequence b : sequence of comparable Changed sequence An iterable of operations.
deltas.algorithms.sequence_matcher.process(texts, *args, **kwargs)
class deltas.SequenceMatcher(tokenizer=None)

Constructs a sequence matching diff engine that preserves verion state and is able to process changes sequentially. When detecting changes across many versions of a text this strategy will be about twice as fast as calling diff() sequentially.

Example: >>> from deltas.algorithms import SequenceMatcher >>> engine = SequenceMatcher() >>> >>> processor = engine.processor() >>> ops, a, b = processor.process("This is a version. It has some " + ... "text in it.") >>> print(" ".join(repr(''.join(b[op.b1:op.b2])) for op in ops)) 'This is a version. It has some text in it.' >>> ops, a, b = processor.process("This is a version. However, it " + ... "has different.") >>> print(" ".join(repr(''.join(b[op.b1:op.b2])) for op in ops)) 'This is a version. ' '' 'However, it' ' has ' '' 'different' '.' >>> ops, a, b = processor.process("Switching it up here. This is " + ... "a version.") >>> print(" ".join(repr(''.join(b[op.b1:op.b2])) for op in ops)) 'Switching it up here. ' 'This is a version.' '' 

## Diff engine¶

A deltas.DiffEngine implements a streaming diff strategy that applies a diff algorithm to a sequence of text revisions. By maintaining an internal state, various efficiencies can be implemented.

class deltas.DiffEngine

Constructs a diff engine.

class Processor

Constructs a new diff processor for processing many versions of a single text.

classmethod DiffEngine.from_config(config, name, section_key='diff_engines')

Constructs a deltas.DiffEngine from a configuration doc.

DiffEngine.processor()

Configures and returns a new Processor