I would like to compare the run results of my pipeline. Getting the diff between jsons with the same schema though different data.
Run1 JSON
{"doc_id": 1, "entity": "Anthony", "start": 0, "end": 7}
{"doc_id": 1, "entity": "New York", "start": 30, "end": 38} # Missing from Run2
{"doc_id": 2, "entity": "Istanbul", "start": 0, "end": 8}
Run2 JSON
{"doc_id": 1, "entity": "Anthony", "start": 0, "end": 7} # same as in Run1
{"doc_id": 2, "entity": "Istanbul", "start": 0, "end": 10} # different end span
{"doc_id": 2, "entity": "Karim", "start": 10, "end": 15} # added in Run2, not in Run1
Based on the answer here my approach has been making a tuple out of the json values and then cogrouping using this large composite key made of some of the json values: How do I perform a "diff" on two Sources given a key using Apache Beam Python SDK?
Is there a better way to diff jsons with beam?
Code based on linked answer:
def make_kv_pair(x):
if x and isinstance(x, basestring):
x = json.loads(x)
""" Output the record with the x[0]+x[1] key added."""
key = tuple((x[dict_key] for dict_key in ["doc_id", "entity"]))
return (key, x)
class FilterDoFn(beam.DoFn):
def process(self, (key, values)):
table_a_value = list(values['table_a'])
table_b_value = list(values['table_b'])
if table_a_value == table_b_value:
yield pvalue.TaggedOutput('unchanged', key)
elif len(table_a_value) < len(table_b_value):
yield pvalue.TaggedOutput('added', key)
elif len(table_a_value) > len(table_b_value):
yield pvalue.TaggedOutput('removed', key)
elif table_a_value != table_b_value:
yield pvalue.TaggedOutput('changed', key)
Pipeline code:
table_a = (p | 'ReadJSONRun1' >> ReadFromText("run1.json")
| 'SetKeysRun1' >> beam.Map(make_kv_pair))
table_b = (p | 'ReadJSONRun2' >> ReadFromText("run2.json")
| 'SetKeysRun2' >> beam.Map(make_kv_pair))
joined_tables = ({'table_a': table_a, 'table_b': table_b}
| beam.CoGroupByKey())
output_types = ['changed', 'added', 'removed', 'unchanged']
key_collections = (joined_tables
| beam.ParDo(FilterDoFn()).with_outputs(*output_types))
# Now you can handle each output
key_collections.unchanged | "WriteUnchanged" >> WriteToText("unchanged/", file_name_suffix="_unchanged.json.gz")
key_collections.changed | "WriteChanged" >> WriteToText("changed/", file_name_suffix="_changed.json.gz")
key_collections.added | "WriteAdded" >> WriteToText("added/", file_name_suffix="_added.json.gz")
key_collections.removed | "WriteRemoved" >> WriteToText("removed/", file_name_suffix="_removed.json.gz")