I am looking to aggregate a large number of JSON files are stored in S3 buckets based on the relationship between multiple of their keys. Each bucket contains files with different schema. There are too many files to loop through with a Python or batch script.
For example, if bucket 1 has files with schema1 and bucket 2 has files with schema2, I would like to aggregate files based on the below logic:
(schema1.key1 == schema2.key2 && schema2.key3 > schema1.key4)
I was considering Spark, but was unable to find documentation on aggregating by keys for comparisons other than equality. Is Spark the best solution or is there a better solution I should be using?
Any advice would be greatly appreciated.