I want to optimize RDD join with Cassandra via Spark. I am trying to read a data, and join with Cassandra data. I was trying to use datastax cassandra connector for this. But it is giving me an error -- Invalid row size: 6 instead of 4. Here are the details
import com.datastax.spark.connector.cql.CassandraConnector
val ip15M = sqlContext.read.parquet("/home/hadoop/work/data").toDF();
ip15M.dtypes
res8: Array[(String, String)] = Array((key1,StringType),(key2,StringType), (key3,StringType), (column1,StringType),
(fact1,StringType),(fact2,StringType)
val joinWithRDD = ip15M.rdd.joinWithCassandraTable("key","tabl1").on(SomeColumns("key1","key2","key3","column1"))
joinWithRDD.take(10).foreach(println)
I have the following Cassandra Table:
CREATE TABLE key.tabl1 (
key1 text,
key2 text,
key3 text,
column1 text,
value1 text,
value2 text,
PRIMARY KEY ((key1, key2, key3), column1)
) WITH CLUSTERING ORDER BY (column1 ASC)
AND bloom_filter_fp_chance = 0.01
AND caching = {'keys': 'ALL', 'rows_per_partition': 'NONE'}
AND comment = ''
AND compaction = {'class': 'org.apache.cassandra.db.compaction.SizeTieredCompactionStrategy'}
AND compression = {'sstable_compression': 'org.apache.cassandra.io.compress.LZ4Compressor'}
AND dclocal_read_repair_chance = 0.1
AND default_time_to_live = 0
AND gc_grace_seconds = 864000
AND max_index_interval = 2048
AND memtable_flush_period_in_ms = 0
AND min_index_interval = 128
AND read_repair_chance = 0.0
AND speculative_retry = '99p';
I am getting the below error
org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 9.0 failed 4 times, most recent failure: Lost task 0.3 in stage 9.0 (TID 332, mr25p01if-ingx03030701.mr.if.apple.com, executor 146): java.lang.IllegalArgumentException: requirement failed: Invalid row size: 6 instead of 4.
I believe the error is due to the fact that RDD has 6 columns while my Cassandra table has 4 primary keys. I need the fact columns in the RDD since I need to update the values based on the join. I am not sure how to resolve this issue.
I tried running with and without the .on, but still the same error. Based on what I see, the '.on' is for Cassandra side columns, not the RDD
Let me know if any other inputs are needed
Updates: If I create an RDD by using parallelize, the join seems to work. It seems when I read a file and change to RDD, it loses the schema
Any help is appreciated