Batch processing for ML pipeline - questions in regards to data ingestion and data storage

Question

I am working on a batch data pipeline for a ML application (large scale) and have some questions

Data Ingestion Layer: In most sources that I have read so far Kafka is suggested for pulling data from the original source (a CSV in my case). When reading only about Kafka though, it is described as a tool mainly used for stream processing. Why should I use Kafka for batch processing then and is this really best practice? What are the alternatives?
Data Storage: I am also very indecisive when it comes to data storage for the batch processing pipeline. Most of the microservices seem to over-perform for my actual needs and I am wondering if there might be a lean solution. E.g. I have a dataset with structured data so MongoDB for example might be a bit excessive.

Further comments: so far, I have settled on using PySpark for Data Processing & Aggregation. Airflow & Docker for scheduling & orchestration

I have to decide on certain microservices to use and come up with a flow chart for the whole data pipeline. .

OneCricketeer · Answer 1 · 2023-07-31T23:13:36.960

If you have a static CSV file, no, you shouldn't really use Kafka. If you are making any updates to this file, even, I would suggest you use Postgres/MySQL, not MongoDB. And only if you are making updates to this data (in a SQL database, or using Mongo if you choose), you can use Debezium to pull specific changes as a stream of events into a Kafka topic.

If you have a static file that never changes, then provide it as --files flag to spark-submit, or upload to S3/HDFS, read it in the code (as a file), or using Spark JDBC reader / SparkSQL MongoDB reader after upload to that database, and you're done.

Kafka could be used in batch processing if you run a producer/consumer task on some scheduled interval and only processed a few records at a time, but this is not a common use-case that I've seen outside of log-archival purposes (consume a large chunk of events, and store in large chunks in HDFS/S3). But even for that, there was rarely a case where that data had been searched more than a few months back, so it was a waste of storage, IMO.

You also don't strictly need Docker, or even "microservices", as Airflow can schedule plain Python or Bash processes, and Spark natively separates your code into pipelines (executor tasks)

Thanks for the reply. In my project, the data is trainings-data for the ML application and is supposed to be updated quarterly. In a real-world-example, this would probably mean a new CSV every three months or pulling directly from a data lake or bucket. — user22316802, Aug 01 '23 at 04:48
Kafka defaults to hold data for only 7 days, not 90. It'll still be better for you to not use it — OneCricketeer, Aug 01 '23 at 12:55

Batch processing for ML pipeline - questions in regards to data ingestion and data storage

1 Answers1