13

TL;DR

I have a table with about 2 million WRITEs over the month and 0 READs. Every 1st day of a month, I need to read all the rows written on the previous month and generate CSVs + statistics.

How to work with DynamoDB in this scenario? How to choose the READ throughput capacity?

Long description

I have an application that logs client requests. It has about 200 clients. The clients need to receive on every 1st day of a month a CSV with all the requests they've made. They also need to be billed, and for that we need to calculate some stats with the requests they've made, grouping by type of request.

So in the end of the month, a client receives a report like:

Full list of requests

Billing Summary

I've already come to two solutions, but I'm not still convinced on any of them.

1st solution: ok, every last day of the month I increase the READ throughput capacity and then I run a map reduce job. When the job is done, I decrease the capacity back to the original value.

Cons: not fully automated, risk of the DynamoDB capacity not being available when the job starts.

2nd solution: I can break the generation of CSVs + statistics to small jobs in a daily or hourly routine. I could store partial CSVs on S3 and on every 1st day of a month I could join those files and generate a new one. The statistics would be much easier to generate, just some calculations derived from the daily/hourly statistics.

Cons: I feel like I'm turning something simple into something complex.

Do you have a better solution? If not, what solution would you choose? Why?

Rohil_PHPBeginner
  • 6,002
  • 2
  • 21
  • 32
barbolo
  • 3,807
  • 1
  • 31
  • 31
  • Have you taken a look at [DynamoDB export using Data Pipelines](http://docs.aws.amazon.com/datapipeline/latest/DeveloperGuide/dp-importexport-ddb-part2.html)? Does that solve your use case? – mkobit Dec 08 '14 at 08:39
  • Hi, @MikeKobit. I've seen this, but it looks like a solution similar to the 1st one I mentioned. The problem I see is that I'd need to change the READ throughput capacity before running the job and after running the job. That could become a point of failure. – barbolo Dec 08 '14 at 11:06
  • Have you thought about generating the monthly report during the month, and ship+reset on the 1st? Seems like a good de-normalization fit in my mind – Chen Harel Dec 10 '14 at 18:14
  • @ChenHarel This looks similar to the 2nd solution I commented in the question. I have the feeling that this is going to be a complex implementation. Do you have this feeling too? – barbolo Dec 11 '14 at 01:03
  • I feel that your #2 solution is unnecessarily complicated and harder to maintain. I suggest another DynamoDB table for monthly reporting. Every data that comes in gets logged as is in the big table and appends to the ongoing monthly report – Chen Harel Dec 11 '14 at 10:02
  • 1
    @ChenHarel I think this would solve half of the problem: the stats generation. However, I still need to generate a CSV (FULL LIST OF REQUESTS) that would contain a row for each row of the DynamoDB table. That's why I thought I could generate an hourly CSV, join them every day, and join about 30 CSVs in the end of the billing cycle. That way the READ throughput capacity could be predictable and constant. What do you think about that? – barbolo Dec 13 '14 at 15:49
  • I'm not sure I follow with respect to the ongoing generation of CSVs. To conclude my take, if at the end of the day, the data becomes immutable, it makes sense to crunch it daily and save a "day-view". It doesn't matter if it's CSV, JSON or a different DynamoDB table. – Chen Harel Dec 14 '14 at 11:13
  • Would you like to write a final answer? The bounty expires in 3 days, and I think you have gone far enough. – barbolo Dec 14 '14 at 18:50
  • 1
    I guess it won't be any different than Bohemian's answer and I'm here to help, not for the points :) Good luck! – Chen Harel Dec 20 '14 at 14:01

3 Answers3

5

Having been in a similar place myself before, I used, and now recommend to you, to process the raw data:

  • as often as you reasonably can (start with daily)
  • to a format as close as possible to the desired report output
  • with as much calculation/CPU intensive work done as possible

leaving as little to do at report time as possible.

This approach is entirely scaleable - the incremental frequency can be:

  • reduced to as small a window as needed
  • parallelised if required

It also, makes possible re-running past months reports on demand, as the report generation time should be quite small.

In my example, I shipped denormalized, pre-processed (financial calculations) data every hour to a data warehouse, then reporting just involved a very basic (and fast) SQL query.

This had the additional benefit of spreading the load on the production database server to lots of small bites, instead of bringing it to its knees once a week at invoice time (30000 invoiced produced every week).

Bohemian
  • 412,405
  • 93
  • 575
  • 722
  • Your answer is more conceptual than practical, but I agree with your position. I believe it's an approach that adds complexity, but solves the problem. I'll wait for some hours to see if something new/better comes up, otherwise I'll accept your answer. Thanks for helping. – barbolo Dec 16 '14 at 12:44
  • I wired the whole thing up using crontab. I fired a script that was in pseudo `query: copy myview to 'file.dat'`, then `scp file.dat svr2` then `ssh svr2 query: copy table from 'file.dat'` – Bohemian Dec 16 '14 at 12:50
2

I would use the service kinesis to produce a daily and almost real time billing. for this purpose I would create a special DynamoDB table just for the calculated data. (other option is to run it on flat files) then I would add a process which will send events to kinesis service just after you update the regular DynamoDB table.

thus when you reach the end of the month you can just execute whatever post billing calculations you have and create your CSV files from the already calculated table.

I hope that helps.

Wiz Cat
  • 163
  • 1
  • 1
  • 10
  • I've never used Kinesis before, but that wouldn't be a problem. In the end of the billing cycle, I'll need to generate a CSV from the DynamoDB table (which could have some millions rows) and that would still be a problem of irregular READ throughput capacity during the month. I don't think my problem is solved with this answer, but I thank you for answering. – barbolo Dec 13 '14 at 15:42
1

Take a look at Dynamic DynamoDB. It will increase/decrease the throughput when you need it without any manual intervention. The good news is you will not need to change the way the export job is done.

Vitalii Fedorenko
  • 110,878
  • 29
  • 149
  • 111
  • Thanks for the suggestion, really interesting, although it doesn't solve the risk of the capacity no being available when the job starts. :/ – barbolo Dec 22 '14 at 18:43
  • Assuming you have proper retry policy in place, it should be an issue. And, from my experience, unless it's not critical to get the result, you should do exponential retries when dealing with dependencies like DynamoDB. There are many factors that are not under your control (network issues, throughput limitations, AWS AZ failure etc), and a simple retry policy can mitigate these kind of issues by a great percentage. – Vitalii Fedorenko Dec 23 '14 at 01:16