Merging JSON files based on multiple keys at scale

Asked May 27 '20 at 14:15

Active May 27 '20 at 18:44

Viewed 50 times

I am looking to aggregate a large number of JSON files are stored in S3 buckets based on the relationship between multiple of their keys. Each bucket contains files with different schema. There are too many files to loop through with a Python or batch script.

For example, if bucket 1 has files with schema1 and bucket 2 has files with schema2, I would like to aggregate files based on the below logic:

(schema1.key1 == schema2.key2 && schema2.key3 > schema1.key4)

I was considering Spark, but was unable to find documentation on aggregating by keys for comparisons other than equality. Is Spark the best solution or is there a better solution I should be using?

Any advice would be greatly appreciated.

edited May 27 '20 at 18:44

user4157124

2,809
13
27
42

asked May 27 '20 at 14:15

JohnSmith

There is this [solution](https://stackoverflow.com/a/39431957/5594180) from some time ago. Could it be what you need? – Saša Zejnilović May 27 '20 at 19:10
Thank you. That is a big help. I believe I can join on the result of a UDF, which will work for my use case. – JohnSmith May 27 '20 at 20:20

Merging JSON files based on multiple keys at scale

0 Answers0