2

I am rather new to spark, and I wonder what is the best practice when using spark-streaming with Cassandra.

Usually, when performing IO, it is a good practice to execute it inside a Future (in Scala). However, a lot of the spark-cassandra-connector seems to operate synchronously.

For example: saveToCassandra (com.datastax.spark.connector.RDDFunctions)

Is there a good reason why those functions are not async ? should I wrap them with a Future?

zero323
  • 322,348
  • 103
  • 959
  • 935
EranM
  • 303
  • 1
  • 3
  • 14
  • Don't forget that Spark is already a distributed system, meaning your code runs in parallel on different nodes, and inside those nodes perhaps in multiple executors. I'm not sure wrapping `saveToCassandra` in a `Future` is going to give you any benefit at all, if not degrade the performance. – Yuval Itzchakov Jul 03 '16 at 11:40
  • This is the reason i having difficulty to decide on that matter. Spark also might doing optimization i am not aware of on the implementation in the different functions (for example, lazy/piggy saving, etc.) – EranM Jul 03 '16 at 13:23
  • 1
    If you're really interested, I'd say profile and see what the results are. – Yuval Itzchakov Jul 03 '16 at 13:29
  • the main reason is running saveToCassandra will basically submit a bunch of tasks to the DAGScheduler, these tasks will saturate the system and you won't be able to do anything else on the spark cluster at the same time except for trivially small RDDs. For those you could benefit from doing the call asynchronously maybe – RussS Jul 07 '16 at 00:19

1 Answers1

2

While there are legitimate cases when you can benefit from asynchronous execution of the driver code it is not a general rule. You have to remember that the driver itself is not the place where actual work is performed and Spark execution is a subject of different types of constraints in particular:

  • scheduling constraints related to resource allocation and DAG topology
  • batch order in streaming applications

Moreover thinking about the actions like saveToCassandra as IO operation is a significant oversimplification. Spark actions are just entry points for Spark jobs where typically IO activity is just a tip of the iceberg.

If you perform multiple actions per batch and have enough resources to do it without negative impact on individual jobs or you want to perform some type of IO in the driver thread itself then async execution can be useful. Otherwise you probably wasting your time.

zero323
  • 322,348
  • 103
  • 959
  • 935