How to implement Data pipeline with recurrent tasks?

Question

I have to set up a data pipeline for an app I try to create but I am not sure how to do it.

I have 2 entities in the database: A and B, each entity B belong to an entity A.

Every minute, I fetch many B entities but one field is missing (on each B entity). So before saving the B entities I need to compute this field on each one of them. Given a B entity and the corresponding A entity, I need the last 20 B entities saved (in the database, so without the missing field) that belongs to the A entity to compute the missing field.

Pseudo code every minute is:

http request to fetch list of new B entities to save.
For each B entity:
- read the A entity of the B entity (B entities have a field with id of A entity they belong to)
- get last 20 B entities saved of the A entity
- compute missing field and save B entity

Order of magnitude: 20k A entities, 30 millions B entities saved and 1k new B entities every minute (this 1k B entities belongs to around 300 A entities)

Instead of querying the database every minute to get the last 20 B entities saved for each A entities found in the list of fetched B entities, I thought I could implement a cache system that stores the last 20 saved B entities for each A entity.

So my first idea was:

Implement an AWS lambda function with a cache system (https://dashbird.io/blog/leveraging-lambda-cache-for-serverless-cost-efficiency/) that does all the logic described every minute.
Add a CRON that calls the lambda function every minute.
Data is stored in an sql database (mysql) on AWS.

Since it is the first time I have to set up a data pipeline, I’m note sure my first idea is good and I have multiple questions:

How would you implement it ?
Is the caching a good idea ? Is it better to just query the database ?
Is AWS a good choice ?
- CRON has a limit of 1 minute so I am reaching the limits of CRON…
- The structure of the cache would be a dictionary with key: entity A id (string of 20 characters) value: list of 20 numbers (20k A entities, so 20k key-value pairs). Does it make sense to build such data cache on an AWS lambda function ?
Do you advise me to use a data pipeline framework instead or another technology ?

thank you in advance for your feedback :)

score 0 · Accepted Answer · answered Oct 09 '21 at 14:43

0

I think querying the RDS with a limit and order by creation would be more easy and less of a hazzle then having a lambda for caching. If there is a lot of load you can have a read replica in place for handling the reads.

About the cron job with Lambda, yeah why not. Make sure you know how long this lambda on average runs. Might be cheaper and more efficient if you have a dedicated container setup for it.

Not sure if all of this really needs to be in a data pipeline as this is a fairly easy setup.

answered Oct 09 '21 at 14:43

Lucasz

1,150
9
19

thank you for your feedback, I did not think about using a dedicated container instead of of a lambda and compare pricing between this two solutions, so I will do that. And for the cache, I think I will try without it as you advice (and add it if I realize it's not efficient). Thank you again. – Vince M Oct 10 '21 at 19:18

How to implement Data pipeline with recurrent tasks?

1 Answers1