1

I have a 16 node cluster where every node has Spark and Cassandra installed while I am using the Spark-Cassandra Connector 3.0.0. I would like to Join a spark Dataset with a Cassandra table on the partition key (directJoinSizeRatio is also set). For example:

Dataset<Row> df1 = sp.createDataset(stringlist, Encoders.STRING()).toDF("parkey");
Dataset<Row> df2 = sp.read().format("org.apache.spark.sql.cassandra")
     .options(new HashMap<String, String>() {
            {
               put("keyspace", "keyspacename");
               put("table", "tablename");
            }
      })
      .load().select(col("parkey"), col("col1"), col("col2")).join(df1, "parkey");

However I would like to know how can I achieve data locality in order to avoid network traffic. So:

  1. Is it possible to use repartitionByCassandraReplica of SCC with the DataFrame API in Java?

  2. Is it possible that repartitionByCassandraReplica happens automatically under the hood in the above example?

Des0lat0r
  • 482
  • 3
  • 18
  • It's been quite a while..but any ideas on this? I managed to go from dataframe to RDD, then repartitionByCassandraReplica and collect the RDD but I get an empty dataframe!! https://stackoverflow.com/questions/69676599/spark-cassandra-connector-repartitionbycassandrareplica-returns-empty-rdd-ja – Des0lat0r Jul 28 '22 at 12:14

0 Answers0