Usage¶
Module for merging JSON objects.
To use this module you need to first import the main class:
>>> from json_merger import Merger
Then, import the configuration options:
>>> from json_merger.config import UnifierOps, DictMergerOps
The Basic Use Case¶
Let’s assume we have JSON records that don’t have any list fields –
They have string keys and as values other objects or primitive types.
In order to perform a merge we assume we have a lowest common ancestor
(root
), a current version (head
) and another version wich we want to
integrate into our record (update
).
>>> root = {'name': 'John'} # A common ancestor of our person record
>>> head = {'name': 'Johnny', 'age': 32} # The current version of the record.
>>> update = {'name': 'Jonathan', 'address': 'Home'} # An updated version.
In this case we want to use the merger to compute one of the possible versions.
We create a merger instance in which we provide the default operation for non-list fields and the one for list fields.
>>> m = Merger(root, head, update, DictMergerOps.FALLBACK_KEEP_HEAD,
... UnifierOps.KEEP_UPDATE_AND_HEAD_ENTITIES_HEAD_FIRST)
... # Ignore UnifierOps for now.
>>> # We might get some exceptions
>>> from json_merger.errors import MergeError
>>> try:
... m.merge()
... except MergeError:
... pass # We don't care about this now.
>>> m.merged_root == {
... 'name': 'Johnny',
... 'age': 32,
... 'address': 'Home',
... }
True
The merged version kept the age
field from the head
object and the
address
field from the update
object. The name
field was different,
but because the strategy was FALLBACK_KEEP_HEAD
the end result kept the
value from the head
variable. To keep the update
one, one can
use FALLBACK_KEEP_UPDATE
:
>>> m = Merger(root, head, update, DictMergerOps.FALLBACK_KEEP_UPDATE,
... UnifierOps.KEEP_ONLY_HEAD_ENTITIES)
>>> rasised_something = False
>>> try:
... m.merge()
... except MergeError:
... raised_something = True
>>> m.merged_root == {
... 'name': 'Jonathan',
... 'age': 32,
... 'address': 'Home',
... }
True
If this type of conflict occurs, the merger will also populate a conflicts
field. In this case the conflict holds the alternative name for our record.
Also, because a conflict occurred, the merge method also raised a MergeError.
For all the types of conflict that can be raised by the merge
method
also check the json_merger.conflict.ConflictType
documentation.
>>> from json_merger.conflict import Conflict, ConflictType
>>> m.conflicts[0] == Conflict(ConflictType.SET_FIELD, ('name', ), 'Johnny')
True
>>> raised_something
True
Merging Lists With Base Values¶
For this example we are going to assume we want to merge sets of badges that a person can receive.
>>> root = {'badges': ['bad', 'random']}
>>> head = {'badges': ['cool', 'nice', 'random']}
>>> update = {'badges': ['fun', 'nice', 'healthy']}
The most simple options are to either keep only the badges available in head or only the badges available in the update. This can be done by specifying one of:
UnifierOps.KEEP_ONLY_HEAD_ENTITIES
UnifierOps.KEEP_ONLY_UPDATE_ENTITIES
>>> m = Merger(root, head, update, DictMergerOps.FALLBACK_KEEP_HEAD,
... UnifierOps.KEEP_ONLY_HEAD_ENTITIES)
>>> m.merge() # No conflict here
>>> m.merged_root['badges'] == ['cool', 'nice', 'random']
True
>>> m = Merger(root, head, update, DictMergerOps.FALLBACK_KEEP_HEAD,
... UnifierOps.KEEP_ONLY_UPDATE_ENTITIES)
>>> m.merge()
>>> m.merged_root['badges'] == ['fun', 'nice', 'healthy']
True
If we want to do a union of the elements we can use:
UnifierOps.KEEP_UPDATE_AND_HEAD_ENTITIES_HEAD_FIRST
UnifierOps.KEEP_UPDATE_AND_HEAD_ENTITIES_UPDATE_FIRST
>>> m = Merger(root, head, update, DictMergerOps.FALLBACK_KEEP_HEAD,
... UnifierOps.KEEP_UPDATE_AND_HEAD_ENTITIES_HEAD_FIRST)
>>> m.merge() # No conflict here
>>> m.merged_root['badges'] == ['cool', 'fun', 'nice', 'random', 'healthy']
True
>>> m = Merger(root, head, update, DictMergerOps.FALLBACK_KEEP_HEAD,
... UnifierOps.KEEP_UPDATE_AND_HEAD_ENTITIES_UPDATE_FIRST)
>>> m.merge()
>>> m.merged_root['badges'] == ['fun', 'cool', 'nice', 'healthy', 'random']
True
These options keep the order relations between the entities. For example,
both 'fun'
and 'cool'
were placed before the 'nice'
entity but
between them there isn’t any restriction. In such cases, for
KEEP_UPDATE_AND_HEAD_ENTITIES_HEAD_FIRST
we first pick the elements
that occur only in the head list and for
KEEP_UPDATE_AND_HEAD_ENTITIES_UPDATE_FIRST
we first pick the ones that
occur only in the update list. If no such ordering is possible we first
add the elements found in the prioritized list and then the remaining ones.
Also, the method will raise a REORDER conflict.
>>> m = Merger([], [1, 2, 5, 3], [3, 1, 2, 4],
... DictMergerOps.FALLBACK_KEEP_HEAD,
... UnifierOps.KEEP_UPDATE_AND_HEAD_ENTITIES_HEAD_FIRST)
>>> try:
... m.merge()
... except MergeError:
... pass
>>> m.merged_root == [1, 2, 5, 3, 4]
True
>>> m.conflicts == [Conflict(ConflictType.REORDER, (), None)]
True
>>> m = Merger([], [1, 2, 5, 3], [3, 1, 2, 4],
... DictMergerOps.FALLBACK_KEEP_HEAD,
... UnifierOps.KEEP_UPDATE_AND_HEAD_ENTITIES_UPDATE_FIRST)
>>> try:
... m.merge()
... except MergeError:
... pass
>>> m.merged_root == [3, 1, 2, 4, 5]
True
>>> m.conflicts == [Conflict(ConflictType.REORDER, (), None)]
True
In the case in which root
is represented by the latest automatic update
of a record (e.g. crawling some metadata source),
head
by manual edits of root
and update
by a new automatic
update, we might want to preserve only the entities in update
but
notify the user in case some manual addition was removed.
UnifierOps.KEEP_UPDATE_ENTITIES_CONFLICT_ON_HEAD_DELETE
>>> root = {'badges': ['bad', 'random']}
>>> head = {'badges': ['cool', 'nice', 'random']}
>>> update = {'badges': ['fun', 'nice', 'healthy']}
>>> m = Merger(root, head, update, DictMergerOps.FALLBACK_KEEP_HEAD,
... UnifierOps.KEEP_UPDATE_ENTITIES_CONFLICT_ON_HEAD_DELETE)
>>> try:
... m.merge()
... except MergeError:
... pass
>>> m.merged_root['badges'] == ['fun', 'nice', 'healthy']
True
>>> m.conflicts == [Conflict(ConflictType.ADD_BACK_TO_HEAD,
... ('badges', ), 'cool')]
True
In this case, only 'cool'
was added “manually” and removed by the update.
Merging Lists Of Objects¶
Assume the most complex case in which we need to merge lists of objects which can also contain nested lists.
>>> root = {
... 'people': [
... {'name': 'John', 'age': 13},
... {'name': 'Peter'},
... {'name': 'Max'}
... ]}
>>> head = {
... 'people': [
... {'name': 'John', 'age': 14,
... 'group': {'id': 'grp01'},
... 'person_id': '42',
... 'badges': [{'t': 'b0', 'e': True}, {'t': 'b1'}, {'t': 'b2'}]},
... {'name': 'Peter', 'age': 15,
... 'badges': [{'t': 'b0'}, {'t': 'b1'}, {'t': 'b2'}]},
... {'name': 'Max', 'age': 16}
... ]}
>>> update = {
... 'people': [
... {'name': 'Max', 'address': 'work'},
... {'name': 'Johnnie', 'address': 'home',
... 'group': {'id': 'GRP01'},
... 'person_id': '42',
... 'age': 15,
... 'badges': [{'t': 'b1'}, {'t': 'b2'}, {'t': 'b0', 'e': False}]},
... ]}
First of all we would like to define how to person records represent the same entity. In this demo data model we can say that two records represent the same person if any of the following is true:
- They have the same
name
- They have the same lowercased group id AND the same person_id
Then we define two badges as equal if they have the same t
attribute.
In order to define a custom mode of linking records you can add comparator
classes for any of the list fields via the coparators keyword argument.
To define a simple comparsion that checks field equality you
can use json_merger.comparator.PrimaryKeyComparator
In this case the fields from above look like this:
>>> from json_merger.comparator import PrimaryKeyComparator
>>> class PersonComparator(PrimaryKeyComparator):
... primary_key_fields = ['name', ['group.id', 'person_id']]
... normalization_functions = {'group.id': str.lower}
>>> class BadgesComparator(PrimaryKeyComparator):
... primary_key_fields = ['t']
Note
You need to use a comparator class and not a comparator instance when defining the equality of two objects.
Next we would like to define how to do the merging:
- In case of conflict keep
head
values.- For every list try to keep only the update entities.
- For the badges list keep both entities with priority to the
update
values.
>>> comparators = {'people': PersonComparator,
... 'people.badges': BadgesComparator}
>>> list_merge_ops = {
... 'people.badges': UnifierOps.KEEP_UPDATE_AND_HEAD_ENTITIES_UPDATE_FIRST
... }
>>> m = Merger(root, head, update,
... DictMergerOps.FALLBACK_KEEP_HEAD,
... UnifierOps.KEEP_ONLY_UPDATE_ENTITIES,
... comparators=comparators,
... list_merge_ops=list_merge_ops)
>>> try:
... m.merge()
... except MergeError:
... pass
>>> m.merged_root == {
... 'people': [
... {'name': 'Max', 'address': 'work', 'age': 16},
... {'name': 'Johnnie', # Only update edited it.
... 'address': 'home',
... 'group': {'id': 'grp01'}, # From KEEP_HEAD
... 'person_id': '42',
... 'age': 14, # From KEEP_HEAD
... 'badges': [{'t': 'b1'}, {'t': 'b2'},
... {'t': 'b0', 'e': True}], # From KEEP_HEAD
... },
... ]}
True
Merging Data Lists¶
If you want to merge arrays of raw data (that do not encode any entities),
you can use the data_lists
keyword argument. This argument treats
list indices as dictionary keys.
>>> root = {'f': {'matrix': [[0, 0], [0, 0]]}}
>>> head = {'f': {'matrix': [[1, 1], [0, 0]]}}
>>> update = {'f': {'matrix': [[0, 0], [1, 1]]}}
>>> m = Merger(root, head, update,
... DictMergerOps.FALLBACK_KEEP_HEAD,
... UnifierOps.KEEP_ONLY_UPDATE_ENTITIES,
... data_lists=['f.matrix'])
>>> m.merge()
>>> m.merged_root == {'f': {'matrix': [[1, 1], [1, 1]]}}
True
Extending Comparators¶
The internal API uses classes that extend
json_merger.comparator.BaseComparator
in order to check the semantic
equality of JSON objects. The interals call the get_matches
method which
is implemented in terms of the equals
method. The most simple method to
extend this class is to override the equals
method.
>>> from json_merger.comparator import BaseComparator
>>> class CustomComparator(BaseComparator):
... def equal(self, obj1, obj2):
... return abs(obj1 - obj2) < 0.2
>>> comp = CustomComparator([1, 2], [1, 2, 1.1])
>>> comp.get_matches('l1', 0) # elements matching l1[0] from l2
[(0, 1), (2, 1.1)]
If you want to implement another type of asignment you an compute all the
mathes and store them in the matches
set by overriding the
process_lists
method. You need to put pairs of matching indices between
l1 and l2.
>>> from json_merger.comparator import BaseComparator
>>> class CustomComparator(BaseComparator):
... def process_lists(self):
... self.matches.add((0, 0))
... self.matches.add((0, 1))
>>> comp = CustomComparator(['foo', 'bar'], ['bar', 'foo'])
>>> comp.get_matches('l1', 0) # elements matching l1[0] from l2
[(0, 'bar'), (1, 'foo')]
[contrib] Distance Function Matching¶
To implement fuzzy matching we also allow matching by using a distane function. This ensures a 1:1 mapping betwen the entities by minimizing the total distance between all linked entities. To mark two of them as equal you can provide a threshold for that distance. (This is why it’s best to normalize it between 0 and 1). Also, for speeding up the matching you also can hint possible matches by bucketing matching elements using a normalization function. In the next example we would match some points in the coordinate system, each of them lying in a specific square. The distance that we are going to use is the euclidean distance. We will normalize the points to their integer counterpart.
>>> from json_merger.contrib.inspirehep.comparators import (
... DistanceFunctionComparator)
>>> from math import sqrt
>>> class PointComparator(DistanceFunctionComparator):
... distance_function = lambda p1, p2: sqrt((p1[0] - p2[0]) ** 2 +
... (p1[1] - p2[1]) ** 2)
... normalization_functions = [lambda p: (int(p[0]), int(p[1]))]
... threshold = 0.5
>>> l1 = [(1.1, 1.1), (1.2, 1.2), (2.1, 2.1)]
>>> l2 = [(1.11, 1.11), (1.25, 1.25), (2.15, 2.15)]
>>> comp = PointComparator(l1, l2)
>>> comp.get_matches('l1', 0) # elements matching l1[0] from l2
[(0, (1.11, 1.11))]
>>> # match only the closest element, not everything under threshold.
>>> comp.get_matches('l1', 1)
[(1, (1.25, 1.25))]
>>> comp.get_matches('l1', 2)
[(2, (2.15, 2.15))]
[contrib] Custom Person Name Distance¶
We also provide a person name distance based on edit distance normalized
between 0 and 1. You just need to provide a function for tokenizing a full
name into NameToken or NameInitial - check simple_tokenize
in the
contrib directory. This distance function matches initials with full
regular tokens and works with any name permutation. Also, this distance
calculator assumes the full name is inside the full_name
field of a
dictionary. If you have the name in a different field you can just override
the class and call super
on objects having the name in the full_name
field.
>>> from json_merger.contrib.inspirehep.author_util import (
... AuthorNameDistanceCalculator, simple_tokenize)
>>> dst = AuthorNameDistanceCalculator(tokenize_function=simple_tokenize)
>>> dst({'full_name': u'Doe, J.'}, {'full_name': u'John, Doe'}) < 0.1
True
Also we have functions for normalizing an author name with different heuristics to speed up the distance function matching.
>>> from json_merger.contrib.inspirehep.author_util import (
... AuthorNameNormalizer)
>>> identity = AuthorNameNormalizer(simple_tokenize)
>>> identity({'full_name': 'Doe, Johnny Peter'})
('doe', 'johnny', 'peter')
>>> one_fst_name = AuthorNameNormalizer(simple_tokenize,
... first_names_number=1)
>>> one_fst_name({'full_name': 'Doe, Johnny Peter'})
('doe', 'johnny')
>>> last_name_one_initial = AuthorNameNormalizer(simple_tokenize,
... first_names_number=1,
... first_name_to_initial=True)
>>> last_name_one_initial({'full_name': 'Doe, Johnny Peter'})
('doe', 'j')
These instances can be used as class parameters for
DistanceFunctionComparator