How do I perform a “diff” on two set of json buckets using Apache Beam Python SDK?

Question

I would like to compare the run results of my pipeline. Getting the diff between jsons with the same schema though different data.

Run1 JSON

{"doc_id": 1, "entity": "Anthony", "start": 0, "end": 7}
{"doc_id": 1, "entity": "New York", "start": 30, "end": 38} # Missing from Run2
{"doc_id": 2, "entity": "Istanbul", "start": 0, "end": 8}

Run2 JSON

{"doc_id": 1, "entity": "Anthony", "start": 0, "end": 7} # same as in Run1
{"doc_id": 2, "entity": "Istanbul", "start": 0, "end": 10} # different end span
{"doc_id": 2, "entity": "Karim", "start": 10, "end": 15} # added in Run2, not in Run1

Based on the answer here my approach has been making a tuple out of the json values and then cogrouping using this large composite key made of some of the json values: How do I perform a "diff" on two Sources given a key using Apache Beam Python SDK?

Is there a better way to diff jsons with beam?

Code based on linked answer:

def make_kv_pair(x):
    if x and isinstance(x, basestring):
        x = json.loads(x)
    """ Output the record with the x[0]+x[1] key added."""
    key = tuple((x[dict_key] for dict_key in ["doc_id", "entity"]))
    return (key, x)


class FilterDoFn(beam.DoFn):
    def process(self, (key, values)):
        table_a_value = list(values['table_a'])
        table_b_value = list(values['table_b'])
        if table_a_value == table_b_value:
            yield pvalue.TaggedOutput('unchanged', key)
        elif len(table_a_value) < len(table_b_value):
            yield pvalue.TaggedOutput('added', key)
        elif len(table_a_value) > len(table_b_value):
            yield pvalue.TaggedOutput('removed', key)
        elif table_a_value != table_b_value:
            yield pvalue.TaggedOutput('changed', key)

Pipeline code:

table_a = (p | 'ReadJSONRun1' >> ReadFromText("run1.json")
           | 'SetKeysRun1' >> beam.Map(make_kv_pair))
table_b = (p | 'ReadJSONRun2' >> ReadFromText("run2.json")
           | 'SetKeysRun2' >> beam.Map(make_kv_pair))

joined_tables = ({'table_a': table_a, 'table_b': table_b}
                 | beam.CoGroupByKey())

output_types = ['changed', 'added', 'removed', 'unchanged']

key_collections = (joined_tables
                   | beam.ParDo(FilterDoFn()).with_outputs(*output_types))

# Now you can handle each output
key_collections.unchanged | "WriteUnchanged" >> WriteToText("unchanged/", file_name_suffix="_unchanged.json.gz")
key_collections.changed | "WriteChanged" >> WriteToText("changed/", file_name_suffix="_changed.json.gz")
key_collections.added | "WriteAdded" >> WriteToText("added/", file_name_suffix="_added.json.gz")
key_collections.removed | "WriteRemoved" >> WriteToText("removed/", file_name_suffix="_removed.json.gz")

A couple things: 1) in `FilterDoFn` are you using the `removed` tag instead of `deleted`? 2) to detect changes you should use as key `doc_id` + `entity` tuple like in the other example. If you incorporate `start` and `end` as part of the key and they are different between JSON runs those events won't be grouped together. — Guillem Xercavins, May 01 '19 at 14:42
Thanks Guillem incorporated your suggestions in the question. I updated the composite key to be `doc_id` + `entity` and made sure `removed` tag is used consistently which fixed the issue I was having with changes and removed elements. — swartchris8, May 01 '19 at 15:05

How do I perform a “diff” on two set of json buckets using Apache Beam Python SDK?

0 Answers0