I have a 16 node cluster where every node has Spark and Cassandra installed while I am using the Spark-Cassandra Connector 3.0.0. I would like to Join a spark Dataset with a Cassandra table on the partition key (directJoinSizeRatio is also set). For example:
Dataset<Row> df1 = sp.createDataset(stringlist, Encoders.STRING()).toDF("parkey");
Dataset<Row> df2 = sp.read().format("org.apache.spark.sql.cassandra")
.options(new HashMap<String, String>() {
{
put("keyspace", "keyspacename");
put("table", "tablename");
}
})
.load().select(col("parkey"), col("col1"), col("col2")).join(df1, "parkey");
However I would like to know how can I achieve data locality in order to avoid network traffic. So:
Is it possible to use repartitionByCassandraReplica of SCC with the DataFrame API in Java?
Is it possible that repartitionByCassandraReplica happens automatically under the hood in the above example?