I have to set up a data pipeline for an app I try to create but I am not sure how to do it.
I have 2 entities in the database: A and B, each entity B belong to an entity A.
Every minute, I fetch many B entities but one field is missing (on each B entity). So before saving the B entities I need to compute this field on each one of them. Given a B entity and the corresponding A entity, I need the last 20 B entities saved (in the database, so without the missing field) that belongs to the A entity to compute the missing field.
Pseudo code every minute is:
- http request to fetch list of new B entities to save.
- For each B entity:
- read the A entity of the B entity (B entities have a field with id of A entity they belong to)
- get last 20 B entities saved of the A entity
- compute missing field and save B entity
Order of magnitude: 20k A entities, 30 millions B entities saved and 1k new B entities every minute (this 1k B entities belongs to around 300 A entities)
Instead of querying the database every minute to get the last 20 B entities saved for each A entities found in the list of fetched B entities, I thought I could implement a cache system that stores the last 20 saved B entities for each A entity.
So my first idea was:
- Implement an AWS lambda function with a cache system (https://dashbird.io/blog/leveraging-lambda-cache-for-serverless-cost-efficiency/) that does all the logic described every minute.
- Add a CRON that calls the lambda function every minute.
- Data is stored in an sql database (mysql) on AWS.
Since it is the first time I have to set up a data pipeline, I’m note sure my first idea is good and I have multiple questions:
- How would you implement it ?
- Is the caching a good idea ? Is it better to just query the database ?
- Is AWS a good choice ?
- CRON has a limit of 1 minute so I am reaching the limits of CRON…
- The structure of the cache would be a dictionary with key: entity A id (string of 20 characters) value: list of 20 numbers (20k A entities, so 20k key-value pairs). Does it make sense to build such data cache on an AWS lambda function ?
- Do you advise me to use a data pipeline framework instead or another technology ?
thank you in advance for your feedback :)