Design ideas for importing multiple customer csv files into a transactional database

Question

I am working on redesigning a data pipeline that is responsible for importing customer data in CSV format from cloud buckets that customers own(We have the connection details already) into a transactional database that we own. Constraints:

We import ~10 million files from ~10 thousand customers everyday. Each customer can start multiple imports at any time.
File Sizes vary from small (10 records) to Large (~millions)

Current Design does imports as they come and we are seeing issues with error handling, retries and backpressure management. Ideas:

Use a staging blob store to stage imports and schedule jobs to import data.
Use an orchestrator like Airflow with sensors to do a DAG run for each import.

Curious to see how the community would implement this?

score 0 · Answer 1 · answered Dec 28 '22 at 19:04

You already have the records in cloud buckets, put a queue in there and hang an appropriate number of worker threads on it, you will be able to fine-tune your scaling by increasing/decreasing the number of worker threads until you find the capacity of your database.

Most cloud-based queues (SQS, Azure Queue, GCP Task Queues) have the error retry, backpressure, etc built into them.

Design ideas for importing multiple customer csv files into a transactional database

1 Answers1