0

I am trying to explore on a use case like "we have huge data (50B records) in files and each file has around 50M records and each record has a unique identifier. And it is possible that a record that present in file 10 can also present in file 100 but the latest state of that record is present in file 100. Files sits in AWS S3.

Now lets say around 1B records out of 50B records needs reprocessing and once reprocessing completed, we need to identify all the files which ever has these 1B records and replace the content of those files for these 1B unique ids.

Challenges: right now, we dont have a mapping that tells which file contains what all unique ids. And the whole file replacement needs to complete in one day, which means we needs parallel execution.

We have already initiated a task for maintaining the mapping for file to unique ids, and we need to load this data while processing 1B records and look up in this data set and identify all the distinct file dates for which content replacement is required.

The mapping will be huge, because it has to hold 50B records and may increase as well as it is a growing system.

Any thoughts around this?

Community
  • 1
  • 1
Krish
  • 135
  • 1
  • 3
  • 11
  • If a record with the same Unique ID can exist in multiple files, how do you know which entry is the one you want to use? Is there something else (eg a date field) that can identify the latest record, and is that the one you want to keep? What is the format of your files? CSV? Compressed? Flat or partitioned into directories? Would you be okay if the result was a completely new set of files, rather than editing the existing files? This might be suitable for Amazon Athena or Amazon EMR, depending upon the answers. – John Rotenstein Sep 05 '17 at 06:07
  • The bad part is each record in file does not hold updated date, only the file name name has that latest date. That is why we have initiated a task to create the mapping (unique id, file date) to serve this requirement. The file is delimited text compressed (gz) files. Unfortunately creating new files option was rules out and hence we are left with file replacement option. – Krish Sep 05 '17 at 07:02
  • You haven't said anything about how you're going to use the data. Many choices will depend upon what you intend to do with the data after you have identified unique records. – John Rotenstein Sep 05 '17 at 07:23
  • The other option I have in mind is, irrespective of updated volume, take each file in S3 and check for unique ids in the right side, if we found match then that file is eligible for replacement else ignore that file for replacement. Not sure which is the best practice. But open for any better solution. – Krish Sep 05 '17 at 07:27
  • We maintain this data in Redshift as well, and that holds always latest data. And we use Redshift for querying. – Krish Sep 05 '17 at 07:34
  • If you already have the data in Redshift, what are you trying to achieve by manipulating the files in Amazon S3? You can always export the latest data from Redshift when needed. – John Rotenstein Sep 05 '17 at 07:44
  • We always maintain latest data in redshift and we dont know in how may files and in which files the respective records present. If I want to replace some day file in S3 means, I need to update only records which are changed in current run and all other records need to have old state. So the final out come should be like if customer takes latest file now from S3 he should be able to reconcile all records with the old file (if he maintains it in his system) and for updated records he should see the difference. Other records should be as is. – Krish Sep 05 '17 at 08:26

1 Answers1

1

You will likely need to write a custom script that will ETL all your files.

Tools such as Amazon EMR (Hadoop) and Amazon Athena (Presto) would be excellent for processing the data in the files. However, your requirement to identify the latest version of data based upon filename is not compatible with the way these tools would normally process data. (They look inside the files, not at the filenames.)

If the records merely had an additional timestamp field, then it would be rather simple for either EMR or Presto to read the files and output a new set of files with only one record for each unique ID (with the latest date).

Rather than creating a system to lookup unique IDs in files, you should have your system output a timestamp. This way, the data is not tied to a specific file and can easily be loaded and transformed based upon the contents of the file.

I would suggest:

  • Process each existing file (yes, I know you have a lot!) and add a column that represents the filename
  • Once you have a new set of input files with the filename column (that acts to identify the latest record), use Amazon Athena to read all records and output one row per unique ID (with the latest date). This would be a normal SELECT... GROUP BY statement with a little playing around to get only the latest record.
  • Athena would output new files to Amazon S3, which will contain the data with unique records. These would then be the source records for any future processing you perform.
John Rotenstein
  • 241,921
  • 22
  • 380
  • 470
  • Yes, having a column like filedate for each record would have made us to choose the best processing choice. And unfortunately we dont have that. That is the reason as part of first step we have initiated task to maintain the mapping of ids to filedate. Later that can go to files as well if customer agrees. So the decision of including new filed is not in our choice. – Krish Sep 05 '17 at 07:40
  • As I said if my record has repeated 10 times in 10 different files, I need to update the record in all 10 files, thats where the complexity came and not just in the latest file. – Krish Sep 05 '17 at 07:42
  • Can I use lambda concept here, that triggers when ever data is available in S3 with updated set. I will execute lamba function for each file and compare based on PKID and if there matches are found, then I know that I can update data in the source file. Some how I need to make sure at any point of time not more than 10 files execution happens in parallel, other wise I need to do this in sequential one after other file. – Krish Sep 12 '17 at 06:23
  • Probably best for you to ask a new Question on this topic with all your details. Yes, you can trigger a Lambda function when a file is placed into S3. However, lambda functions run for a maximum of 5 minutes and have local storage of only 500MB. A separate Lambda function will be triggered for each file placed into S3. – John Rotenstein Sep 12 '17 at 06:27