0

I have a requirement whereby I need to convert all my JSON files in my bucket into one new line delimited JSON for a 3rd party to consume. However, I need to make sure that each newly created new delimited JSON only includes files that were received in the last 24 hours in order to avoid picking the same files over and over again. Can this be done inside the s3.getObject(getParams, function(err, data) function? Any advice regarding a different approach is appreciated

Thank you

panza
  • 1,341
  • 7
  • 38
  • 68

1 Answers1

1

You could try S3 ListObjects operation and filter the result by LastModified metadata field. For new objects, the LastModified attribute will contain information when the file was created, but for changed files - when the last modified.

https://docs.aws.amazon.com/AWSJavaScriptSDK/latest/AWS/S3.html#listObjectsV2-property

There is a more complicated approach, using Amazon Athena with AWS Glue services, but this requires to modify your S3 Object keys to split into partitions, where partitions will be the key of date-time. For example:

  • s3://bucket/reports/date=2019-08-28/report1.json
  • s3://bucket/reports/date=2019-08-28/report2.json
  • s3://bucket/reports/date=2019-08-28/report3.json
  • s3://bucket/reports/date=2019-08-29/report1.json

This approach can be implemented in two ways, depending on your file schema. If all your JSON files have the same format/properties/schema, then you can create a Glue Table, add the root reports path as a source for this table, add the date partition value (2019-08-28) and using Amazon Athena query data with a regular SELECT * FROM reports WHERE date='2019-08-28'. If not, then create a Glue crawler with JSON classifier, which will populate your tables, and then using the same Athena - query these data to a combined JSON file

https://docs.aws.amazon.com/glue/latest/dg/aws-glue-programming-python-samples-legislators.html

MMS
  • 90
  • 7