Using spark stream job on a kafka stream

Question

This is more of a picking your brain based on your experience type of question as I am not able to find resources that can help me decide one way or the other.

I have a kafka message queue where around 7-8 million events are streamed every day. The messages eventually need to get persisted in a mysql database.

Approach 1:

I can either write microservices, containerize them and have multiple instances of the container app run with different kafka consumer groups and the kafka listener in each instance that consume events and shove them in to mysql.

Approach 2:

Another approach I was thinking about was just use a spark job and have it process the stream of events and persist them in mysql db. that way I don't have to worry about managing the container app and keep the operations cost down etc.

Given the volume of data, I am not sure if spark is going to be an over kill and the cost of spark would be more than the capital and operations expenses I would incur on a container app let's say on managed kubernetes environment etc.

Can someone guide me how to go about his?

Why don't use Kafka JDBC connector -- https://docs.confluent.io/current/connect/kafka-connect-jdbc/index.html — Nitin, Nov 14 '19 at 17:11
I need some transformation on the messages as well before they get persisted in to mysql db. jdbc connector would just dump the message as is. — Ray S, Nov 14 '19 at 17:13

Nitin · Answer 1 · 2020-10-02T04:51:48.510

You can still use the JDBC Connector.

Approach 1: You can spawn many microservices based on your partition and easy to tune partitions and instances based on your load. It's good for a simple application, you can also control the rate of consuming messages and more manual intervention but it doesn't have stateful data storage, no analytic realtime operation, lots of manual coding.

Approach 2: Spark Stream required extra Spark cluster to manage the load. My understanding Spark is good for realtime analytic, machine learning so if you are doing complex operation you can use aggregate and push aggregated message to DB(aggregated message not raw million data).

Approach 3: Kafka Stream - easy to use with exactly-once semantic and better failover and at the same time give stateful analytic power. You can manage the load by tuning instances based on partitions.

You can choose any above approach based on your need but loading into the database required manual ACID management from multiple parallel instances, failover on DB, transaction management, etc.

So I would suggest not to directly load into DB beside publish it into one topic and create a JDBC sink connector from the topic to your DB

So here is my thought:

Use Kafka Stream (Or Microservice, Consumer Group) --You can very well optimize instances and partition based on your load. I would recommend using Kafka Stream as you can leverage exactly-once semantic, better fault tolerance and you can also use stateful operation in case it's needed.
Do transformation and publish output message to Kafka one topic.
Create a JDBC sink connector from topic to database: This just one that you can control with different parameters.Kafka JDBC Connector

so basically, kafka topic to kafka stream (enrich/modify data) to kafka topic to mysql? is this how it would work ? — Ray S, Nov 15 '19 at 19:36

skjagini · Answer 2 · 2019-11-15T03:20:15.183

0

You have mentioned two approaches and as suggested in comments, Kafka Connect is another approach.

Using Microservice API: Even if you follow this approach, you still need some kind of a service(like a Supervisor based which runs in the background all the time) (load balanced based on topic/partitions) that fetches data from Kafka Topic and calls the rest api, would avoid this approach.
Spark Streaming: Maintaining a Spark Cluster to ingest data from Kafka to a DB is overkill and also remember that Spark by default only supports adding(bulk inserts) to a Table, if you need to update a row or so, you need to iterate and do it yourself in Jdbc style approach, would avoid this approach
Kafka Connect: Is part of Open Sourced version of Kafka framework, can scale based on number of Tasks(can create one for each partition on the topic for max performance) you add for each job deployed, persists the offsets back to the Kafka which allows you to visualize the lag on Kafka Monitoring tools, would definitely follow this approach

edited Nov 15 '19 at 03:20

answered Nov 14 '19 at 17:18

skjagini

3,142
5
34
63

can kafka connect (a CDC tool) be used to persist data from kafta stream to jdbc? – Ray S Nov 14 '19 at 18:42
@RayS Yes you can write from Kafka Stream to DB directly, https://stackoverflow.com/questions/46524930/how-to-process-a-kafka-kstream-and-write-to-database-directly-instead-of-sending, but you need to handle data loss yourself, if you write to a Kafka Topic after the stream operation, you get all the benefits at the expense of little performance – skjagini Nov 14 '19 at 19:12
I am not sure why would I ever write back in to a kafka topic from a kafka stream instead of directly saving it in to db? In any case, if I have a kafka topic to begin with and all I need is to do some transformation on the message and shove it in MySQL, won’t kafka connect be a better choice or, kafka connect doesn’t let you manipulate data in the topic before saving it in dbms? – Ray S Nov 14 '19 at 19:46
My understanding is Kafka Connect only works on a topic, either as a Source or Sink. You can always write from Kafka Stream directly to DB, but not sure if you can use Kafka Connect to persist. If you can't you loose all the benefits that Connect provides – skjagini Nov 14 '19 at 20:49

Using spark stream job on a kafka stream

2 Answers2