9

I am trying to write a Spark Structured Streaming job that reads from multiple Kafka topics (potentially 100s) and writes the results to different locations on S3 depending on the topic name. I've developed this snippet of code that currently reads from multiple topics and outputs the results to the console (based on a loop) and it works as expected. However, I would like to understand what the performance implications are. Would this be the recommended approach? Is it not recommended to have multiple readStream and writeStream operations? If so, what is the recommended approach?

my_topics = ["topic_1", "topic_2"]

for i in my_topics:
    df = spark \
        .readStream \
        .format("kafka") \
        .option("kafka.bootstrap.servers", bootstrap_servers) \
        .option("subscribePattern", i) \
        .load() \
        .selectExpr("CAST(key AS STRING)", "CAST(value AS STRING)")

    output_df = df \
        .writeStream \
        .format("console") \
        .option("truncate", False) \
        .outputMode("update") \
        .option("checkpointLocation", "s3://<MY_BUCKET>/{}".format(i)) \
        .start()
Brandon
  • 375
  • 2
  • 16
  • why do you want different checkpointLocation location for each topic, you can use one for all topics ? – Srinivas Jun 15 '20 at 06:15
  • 1
    Kafka Connect is generally a better approach for Kafka -> S3. I can provide an answer based on that if it would be useful. – Robin Moffatt Jun 15 '20 at 16:30
  • @Srinivas In the event where I need to restart/reset a specific topic by clearing the checkpoint location, would it not be better to have separate checkpoint locations to avoid the possibility of coupling/causing issues with the checkpoints for other topics? – Brandon Jun 15 '20 at 17:20
  • @RobinMoffatt I have explored the option of using Kafka Connect, however, I would like to use Spark Structured Streaming to expand on the number of sinks down the line. – Brandon Jun 15 '20 at 17:22
  • (Kafka Connect can handle regex topic list, if that's your concern.) – Robin Moffatt Jun 15 '20 at 18:59
  • I also want to: 1) Leverage Spark capabilities (ML) 2) Use an existing EMR Cluster instead of spinning up a separate compute environment to run Kafka Connect – Brandon Jun 15 '20 at 21:03

2 Answers2

6

It's certainly reasonable to run a number # of concurrent streams per driver node.

Each .start() consumes a certain amount of driver resources in spark. Your limiting factor will be the load on the driver node and its available resources. 100's of topics running continuously at high rate would need to be spread across multiple driver nodes [In Databricks there is one driver per cluster]. The advantage of Spark is as you mention, multiple sinks and also a unified batch & streaming apis for transformations.

The other issue will be dealing with the small writes you may end up making to S3 and file consistency. Take a look at delta.io to handle consistent & reliable writes to S3.

Douglas M
  • 1,035
  • 8
  • 17
  • 1
    So if it is reasonable to run a number of concurrent streaming jobs per driver node, it seems that it is a matter of finding the right balance based on the size of the cluster. I also know that you can subscribe to multiple topics (and call a single .start() call: https://spark.apache.org/docs/latest/structured-streaming-kafka-integration.html#creating-a-kafka-source-for-streaming-queries). Does this reduce the load (i.e. the amount of driver resources in spark)? – Brandon Jun 22 '20 at 15:41
  • @Brandon Yes, it should. One processing handling multiple streams and the planning vs. multiple processes, each handling one. Your mileage may vary. – Douglas M Jun 24 '20 at 18:37
5

Advantages of below approach.

  1. Generic
  2. Multiple Threads, All threads will work individual.
  3. Easy to maintain code & support for any issues.
  4. If one topic is failed, No impact on other topics in production. You just have to focus on failed one.
  5. If you want to pull all data for specific topic, You just have to stop job for that topic, update or change the config & restart same job.

Note - Below code is not complete generic, You may need to change or tune below code.

topic="" // Get value from input arguments
sink="" // Get value from input arguments

df = spark \
        .readStream \
        .format("kafka") \
        .option("kafka.bootstrap.servers", bootstrap_servers) \
        .option("subscribePattern", topic) \
        .load() \
        .selectExpr("CAST(key AS STRING)", "CAST(value AS STRING)")

    output_df = df \
        .writeStream \
        .format("console") \
        .option("truncate", False) \
        .outputMode("update") \
        .option("checkpointLocation", sink) \
        .start()        

Problems with below approach.

  1. If one topic is failed, It will terminate complete program.
  2. Limited Threads.
  3. Difficult to maintain code, debug & support for any issues.
  4. If you want to pull all data for specific topic from kafka, It's not possible as any config change will apply for all topics, hence its too costliest operation.
my_topics = ["topic_1", "topic_2"]

for i in my_topics:
    df = spark \
        .readStream \
        .format("kafka") \
        .option("kafka.bootstrap.servers", bootstrap_servers) \
        .option("subscribePattern", i) \
        .load() \
        .selectExpr("CAST(key AS STRING)", "CAST(value AS STRING)")

    output_df = df \
        .writeStream \
        .format("console") \
        .option("truncate", False) \
        .outputMode("update") \
        .option("checkpointLocation", "s3://<MY_BUCKET>/{}".format(i)) \
        .start()
Srinivas
  • 8,957
  • 2
  • 12
  • 26
  • 1
    Thanks for your response! If I am trying to read from many Kafka Topics, would it be better to have multiple Spark Structured Streaming jobs (1 per topic) or fewer jobs with multiple topics? What are the performance impacts on a Spark cluster when I have many jobs running concurrently on a cluster? – Brandon Jun 22 '20 at 15:36
  • @Srinivas, do I need to specify different checkpointLocation paths for each Kafka topic to avoid read duplicate offsets on re-submit the spark job? Is it good practice? – deeplay Nov 04 '20 at 19:38
  • 1
    its better to specify different check point location for each kafka topic – Srinivas Nov 05 '20 at 03:35