I am trying to explore on a use case like "we have huge data (50B records) in files and each file has around 50M records and each record has a unique identifier. And it is possible that a record that present in file 10 can also present in file 100 but the latest state of that record is present in file 100. Files sits in AWS S3.
Now lets say around 1B records out of 50B records needs reprocessing and once reprocessing completed, we need to identify all the files which ever has these 1B records and replace the content of those files for these 1B unique ids.
Challenges: right now, we dont have a mapping that tells which file contains what all unique ids. And the whole file replacement needs to complete in one day, which means we needs parallel execution.
We have already initiated a task for maintaining the mapping for file to unique ids, and we need to load this data while processing 1B records and look up in this data set and identify all the distinct file dates for which content replacement is required.
The mapping will be huge, because it has to hold 50B records and may increase as well as it is a growing system.
Any thoughts around this?