0

So, I have a small cluster with 3 Spark workers(2 executors each) and on the same nodes I have also installed Cassandra in order to achieve data locality. In order to evaluate the speed and times(from SparkUI) I run the same code with, firstly one spark-cassandra node, then two and then three spark-cassandra nodes for 3 times in every case. The results are below, but I do not understand why does it take more time with 3 nodes than 2?

enter image description here

I am not sure what to check. For the above times spark.sql.shuffle.partitions was 96, but I tried also the "3 / 3" with 18 partitions and it was still the same (3min 13s, 3min 5s, 3min 19s)

What could be happening and why? Please, let me know if you need more information.

Edit1

The only difference between the first 2 cases and the 3rd is the replication factor in Cassandra db. For the first 2 is 1 and for the 3rd case is 3. Could that be the reason?network traffic and latencies?

Edit2

Below are some pictures from the Stages Tab of SparkUI with 3 spark-cassandra nodes (3rd case). enter image description here

enter image description here

enter image description here

Des0lat0r
  • 482
  • 3
  • 18
  • what your job is doing on the data ? is it loading all data from Cassandra ? or specific partitions ? – Saifallah KETBI Jul 25 '22 at 09:42
  • The job basically is that it pulls data from cassandra and implements a PCA with SparkML. It loads specific partitions and performs a DirectJoin in between. I updated the question with pictures from SparkUI Stages tab. – Des0lat0r Jul 25 '22 at 10:22

0 Answers0