Data Pipeline (DynamoDB to S3) - How to format S3 file?

Question

I have a Data Pipeline that exports my DynamoDB table to an S3 bucket so I can use the S3 file for services like QuickSight, Athena and Forecast.

However, for my S3 file to work with these services, I need the file to be formatted in a csv like so:

date, journal, id
1589529457410, PLoS Genetics, 10.1371/journal.pgen.0030110
1589529457410, PLoS Genetics, 10.1371/journal.pgen.1000047

But instead, my exported file looks like this:

{"date":{"s":"1589529457410"},"journal":{"s":"PLoS Genetics"},"id":{"s":"10.1371/journal.pgen.0030110"}}
{"date":{"s":"1589833552714"},"journal":{"s":"PLoS Genetics"},"id":{"s":"10.1371/journal.pgen.1000047"}}

How can I specify the format for my exported file in S3 so I can operate with services like QuickSight, Athena and Forecast? I'd preferably do the data transformation using Data Pipeline as well.

Jason · Answer 1 · 2020-05-25T14:51:13.250

2

Athena can read JSON data.

You can also use DynamoDB streams to stream the data to S3. Here is a link to a blog post with best practice and design patterns for streaming data from DynamoDB to S3 to be used with Athena.

You can use DynamoDB streams to trigger an AWS Lambda function, which can transform the data and store it in Amazon S3, Amazon Redshift etc. With AWS Lambda you could also trigger Amazon Forecast to retrain, or pass the data to Amazon Forecast for a prediction.

Alternatively you could use Amazon Data Pipeline to write the data to an S3 bucket as you currently have it. Then use a cloud watch event scheduled to run a lambda function, or an S3 event notification to run a lambda function. The lambda function can transform the file and store it in another S3 bucket for further processing.

edited May 25 '20 at 14:51

answered May 25 '20 at 10:08

Jason

2,555
2
16
17

So I guess it's not possible to do any kind of data transformation with data pipeline then. Did I understand this correctly, I need to set up a Kinesis Firehose stream and specify the data transformation with a Lambda function and then save the stream to an S3 file. And what's the benefit of using DynamoDB streams instead of just taking the DynamoDB table as a whole for doing data transformation? – incnnu May 25 '20 at 11:00
@incnnu you can't transform with data pipeline but you can write the files to S3, and then either have a CloudWatch Event trigger a lambda, or S3 trigger a lambda to read the files and turn them into csv, and place the file in another bucket. I will update my answer – Jason May 25 '20 at 14:48

Data Pipeline (DynamoDB to S3) - How to format S3 file?

1 Answers1