I hope my question is simple. What happens when someone enables the Dynamic Allocation of spark with cassandra database?
I have a 16 node cluster where every node has installed versions of Spark and Cassandra, in order to also achieve data locality. I am wondering how does the dynamic allocation works at this case. Spark will calculate the workload in order to "hire" workers right? But how does spark know the size of the data( in order to calculate the workload) from cassandra db unless it tries to query it first?
For example, what if spark hires 2 workers and the data in cassandra are located on a 3rd node? Wouldn't that increase network traffic and time until cassandra copies the data from node 3 to node 2?
I tried it with my application and I saw from SparkUI that the master hired 1 executor to query the data from cassandra and then added another 5 or 6 executors to do the further processing. Overall, it took 10 minutes more that the normal 1 minute that takes without the dynamic allocation.
(FYI: I am also using spark-cassandra-connector 3.1.0)