1

I hope my question is simple. What happens when someone enables the Dynamic Allocation of spark with cassandra database?

I have a 16 node cluster where every node has installed versions of Spark and Cassandra, in order to also achieve data locality. I am wondering how does the dynamic allocation works at this case. Spark will calculate the workload in order to "hire" workers right? But how does spark know the size of the data( in order to calculate the workload) from cassandra db unless it tries to query it first?

For example, what if spark hires 2 workers and the data in cassandra are located on a 3rd node? Wouldn't that increase network traffic and time until cassandra copies the data from node 3 to node 2?

I tried it with my application and I saw from SparkUI that the master hired 1 executor to query the data from cassandra and then added another 5 or 6 executors to do the further processing. Overall, it took 10 minutes more that the normal 1 minute that takes without the dynamic allocation.

(FYI: I am also using spark-cassandra-connector 3.1.0)

Erick Ramirez
  • 13,964
  • 1
  • 18
  • 23
Des0lat0r
  • 482
  • 3
  • 18

1 Answers1

2

The Spark Cassandra connector estimates the size of the table using the values stored in the system.size_estimates table. For example, if the size_estimates indicates that there are 200K CQL partitions in the table and the mean partition size is 1MB, the estimated table size is:

estimated_table_size = mean_partition_size x number_of_partitions
                     = 1 MB x 200,000
                     = 200,000 MB

The connector then calculates the Spark partitions as:

spark_partitions = estimated_table_size / input.split.size_in_mb
                 = 200,000 MB / 64 MB
                 = 3,125

When there is data locality (Spark worker/executor JVMs are co-located with Cassandra JVM), the connector knows which nodes own the data so you can take advantage of this functionaly by using the repartitionByCassandraReplica() so that each Spark partition will be processed by an executor on the same node where the data resides to avoid shuffling.

For more info, see the Spark Cassandra connector documentation. Cheers!

Erick Ramirez
  • 13,964
  • 1
  • 18
  • 23
  • Thank you for your answer! Yes I guess this could answer the question. But, still, repartitionByCassandraReplica returns empty RDD in my case ( https://stackoverflow.com/questions/69676599/spark-cassandra-connector-repartitionbycassandrareplica-returns-empty-rdd-ja ). So I guess I cannot use dynamic allocation too, as it takes way too much time without repartitionByCassandraReplica. – Des0lat0r Aug 25 '22 at 06:31