3

Runtime

YARN cluster mode

Application

  • Spark structured streaming
  • Read data from Kafka topic

About Kafka topic

  • 1 topic with 4 partitions -for now. (number of partitions can be changed)
  • Added 2000 records maximum in topic per 1 second.

I've found out that the number of Kafka topic partitions is matched with the number of spark executors (1:1).
So, in my case, what I know until now, 4 spark executors is the solution I think.
But I'm worried about data throughput - can be ensured 2000 rec/sec?

Is there any guidance or recommendation about setting proper configuration in spark structured streaming?
Especially spark.executor.cores, spark.executor.instances or something about executor.

nullmari
  • 442
  • 5
  • 16

1 Answers1

4

Setting spark.executor.cores to 5 or less is usually considered the most optimal for HDFS I/O throughput. you can read more about it here (or google other articles): https://blog.cloudera.com/blog/2015/03/how-to-tune-your-apache-spark-jobs-part-2/

Each Kafka partition is matched to a spark core, not executor (one spark core can have multiple Kafka partitions but each Kafka partition will have exactly one core).

Deciding what are the exact numbers that you need depends on many other things like your application flow (e.g if you are not doing any shuffle the number of total cores should be exactly your Kafka partitions), memory capacity and requirements etc.

You can play with the configurations and use spark metrics to decide if your application is handling the throughput.

user_s
  • 1,058
  • 2
  • 12
  • 35
  • 2
    "one spark core can have multiple Kafka partitions but each Kafka partition will have exactly one core" -- can you link some sources for this ? I could not find any related information in structured streaming or kafka integration guides. – ksceriath Nov 19 '20 at 09:34
  • The answer https://stackoverflow.com/a/46640771/8137610 provided the needed source. – Shuai Liu Mar 11 '22 at 02:17