Algorithms¶
The primary use-case of this library is to detect differences between two sequences of tokens. So far, two such algorithmic strategies are available:
sequence_matcher
- implementes
diff()
that will compare two sequences ofToken
and return a set of operations. segment_matcher
- implementes
diff()
that uses aSegmenter
to detect block moves
Both of these algorithms are supplimented with a deltas.DiffEngine
for efficiently processing several revisions of the same text
Implemented Algorithms¶
Segment Matcher¶
Performs a diffs using a tree of matchable segments in order to remain robust
to content moves. This module supports the use of a custom
Segmenter
.
-
deltas.algorithms.segment_matcher.
diff
(a, b, segmenter=None)¶ Performs a diff comparison between two sequences of tokens (a and b) using segmenter to cluster and match
deltas.MatchableSegment
.Example: >>> from deltas import segment_matcher, text_split >>> >>> a = text_split.tokenize("This is some text. This is some other text.") >>> b = text_split.tokenize("This is some other text. This is some text.") >>> operations = segment_matcher.diff(a, b) >>> >>> for op in operations: ... print(op.name, repr(''.join(a[op.a1:op.a2])), ... repr(''.join(b[op.b1:op.b2]))) ... equal 'This is some other text.' 'This is some other text.' insert '' ' ' equal 'This is some text.' 'This is some text.' delete ' ' ''
Parameters: - a : list`(:class:`deltas.tokenizers.Token)
Initial sequence
- b : list`(:class:`deltas.tokenizers.Token)
Changed sequence
- segmenter :
deltas.Segmenter
A segmenter to use on the tokens.
Returns: An iterable of operations.
-
deltas.algorithms.segment_matcher.
diff_segments
(a_segments, b_segments)¶ Performs a diff comparison between two pre-clustered
deltas.Segment
trees. In most cases, segmentation takes 100X more time than actually performing the diff.Parameters: - a_segments :
deltas.Segment
An initial sequence
- b_segments :
deltas.Segment
A changed sequence
Returns: An iterable of operations.
- a_segments :
-
deltas.algorithms.segment_matcher.
process
(texts, *args, **kwargs)¶ Processes a single sequence of texts with a
SegmentMatcher
.Parameters: - texts : iterable`(`str)
sequence of texts
- args : tuple
passed to
SegmentMatcher
‘s constructor- kwaths : dict
passed to
SegmentMatcher
‘s constructor
-
class
deltas.
SegmentMatcher
(tokenizer=None, segmenter=None)¶ Constructs a segment matcher diff engine that preserves segmentation state and is able to process changes sequentially. When detecting changes across many versions of a text this strategy will be about twice as fast as calling
diff()
sequentially.Example: >>> from deltas import SegmentMatcher >>> from deltas import text_split >>> >>> engine = SegmentMatcher(text_split) >>> >>> processor = engine.processor() >>> ops, a, b = processor.process("This is a version. It has some " + ... "text in it.") >>> print(" ".join(repr(''.join(b[op.b1:op.b2])) for op in ops)) 'This is a version. It has some text in it.' >>> ops, a, b = processor.process("This is a version. However, it " + ... "has different.") >>> print(" ".join(repr(''.join(b[op.b1:op.b2])) for op in ops)) 'This is a version. ' '' 'However, it' ' has ' '' 'different' '.' >>> ops, a, b = processor.process("Switching it up here. This is a " + ... "version.") >>> print(" ".join(repr(''.join(b[op.b1:op.b2])) for op in ops)) '' 'Switching' ' it ' '' 'up' ' ' '' 'here' '.' ' ' 'This is a version.'
-
class
Processor
(tokenizer=None, segmenter=None, last_text=None, last_tokens=None, last_segments=None)¶ A processor used by the SegmentMatcher difference engine to track the history of a single text.
-
process
(text, token_class=<class 'deltas.tokenizers.token.Token'>)¶ Processes a new version of a text and returns the delta.
Parameters: - text : str
The text to process
Returns: A tuple of operations, a_tokens, b_tokens
-
-
SegmentMatcher.
processor
(*args, **kwargs)¶ Constructs and configures a processor to process versions of a text.
-
class
Sequence Matcher¶
Sequence Matcher¶
Performs a simple longest-common-substring diff. This module implements a
simple wrapper around difflib.SequenceMatcher
.
-
deltas.algorithms.sequence_matcher.
diff
(a, b)¶ Performs a longest common substring diff.
Parameters: - a : sequence of comparable
Initial sequence
- b : sequence of comparable
Changed sequence
Returns: An iterable of operations.
-
deltas.algorithms.sequence_matcher.
process
(texts, *args, **kwargs)¶
-
class
deltas.
SequenceMatcher
(tokenizer=None)¶ Constructs a sequence matching diff engine that preserves verion state and is able to process changes sequentially. When detecting changes across many versions of a text this strategy will be about twice as fast as calling
diff()
sequentially.Example: >>> from deltas.algorithms import SequenceMatcher >>> engine = SequenceMatcher() >>> >>> processor = engine.processor() >>> ops, a, b = processor.process("This is a version. It has some " + ... "text in it.") >>> print(" ".join(repr(''.join(b[op.b1:op.b2])) for op in ops)) 'This is a version. It has some text in it.' >>> ops, a, b = processor.process("This is a version. However, it " + ... "has different.") >>> print(" ".join(repr(''.join(b[op.b1:op.b2])) for op in ops)) 'This is a version. ' '' 'However, it' ' has ' '' 'different' '.' >>> ops, a, b = processor.process("Switching it up here. This is " + ... "a version.") >>> print(" ".join(repr(''.join(b[op.b1:op.b2])) for op in ops)) 'Switching it up here. ' 'This is a version.' ''
Diff engine¶
A deltas.DiffEngine
implements a streaming diff strategy that applies
a diff algorithm to a sequence of text revisions. By maintaining an internal
state, various efficiencies can be implemented.
-
class
deltas.
DiffEngine
¶ Constructs a diff engine.
-
class
Processor
¶ Constructs a new diff processor for processing many versions of a single text.
-
classmethod
DiffEngine.
from_config
(config, name, section_key='diff_engines')¶ Constructs a
deltas.DiffEngine
from a configuration doc.
-
class