Kafka Spark Streaming

Question

I was trying to build Kafka and spark streaming use case. In that, Spark Streaming is consuming streaming from Kafka. And we are enhancing stream and storing enhanced stream into some target system.

My question here is that does it make sense to run spark streaming job in yarn-cluster or yarn-client mode? (Hadoop is not involved here)

What I think Spark streaming job should run only local mode but another question is how to improve the performance of spark streaming job.

Thanks,

What is your use case? amount of data? there are a lot of parameters that need to be taken into consideration. If you are planning to read data from Kafka run some enrichment or SQL aggregations each use case has different configurations parameters for improving Spark Streaming performance. — Arnon Rodman, Nov 13 '19 at 07:51

score 1 · Answer 1 · answered Apr 26 '18 at 15:42

the difference will be with yarn-client, you will force the spark job to choose the host where you run spark-submit as the driver , because in yarn-cluster , the choice won't be the same host everytime you run it

so the best choice is to always choose yarn-cluster to avoide overloading the same host if you are going to submit multi job in the same host with yarn-client

score 0 · Answer 2 · answered Apr 26 '18 at 07:24

local[*]

This is specific to run the job in local mode Usually we use this to perform POC's and on a very small data. You can debug the job to understand how each line of code is working. But, you need to be aware that since the job is running in your local you cannot get the most out of sparks distributed architecture.

yarn-client

your driver program is running on the yarn client where you type the command to submit the spark application . But, the tasks are still executed on the Executors.

yarn-cluster

In cluster mode, the Spark driver runs inside an application master process which is managed by YARN on the cluster, and the client can go away after initiating the application. This is the finest way of running the spark job to be benefited by the advantages provided by a cluster manager

I hope this gives you a clarity on how you may want to deploy your spark job.

Infact, Spark provides you a very clean documentation explaining various deployment strategies with examples. https://spark.apache.org/docs/latest/running-on-yarn.html

Kafka Spark Streaming

2 Answers2