0

So, I have a 16 node cluster where every node has Spark and Cassandra installed with a replication factor of 3 and spark.sql.shuffle.partitions of 96. I am also using the Spark-Cassandra Connector 3.0.0.

I have a Spark Dataset with 4 partition keys and I want to do a DirectJoin with a cassandra table.

  1. Should I use repartitionByCassandraReplica? Is there a recommended number of partition keys for which it would make sense to use repartitionByCassandraReplica before a DirectJoin?

  2. Is there also a recommended number for partitionsPerHost parameter? How could I just get 4 spark partitions in total if I have 4 partition keys..so that rows with the same partition key would be found in one spark partition?

  3. If I do not use repartitionByCassandraReplica, I can see from SparkUI that DirectJoin is implemented. However if I use repartitionByCassandraReplica on same partition keys then I do not see any DirectJoin in the DAG, just a CassandraPartitionedRDD and later on a HashAggregate. Also it takes ~5 times more time than without repartitionByCassandraReplica. Any idea why and what is happening?

  4. Does converting an RDD after repartitionByCassandraReplica to Spark Dataset, change the number or location of partitions?

  5. How can I see if repartitionByCassandraReplica is working properly? I am using nodetool getendpoints to see where the data are stored, but other than that?

Please let me know if you need any more info. I just tried to summarize my questions from Spark-Cassandra: repartitionByCassandraReplica or converting dataset to JavaRDD and back do not maintain number of partitions?

Des0lat0r
  • 482
  • 3
  • 18

0 Answers0