I develop Spark code using Scala APIs on IntelliJ and when I run this I get below error but runs well on the Databricks notebook though.
I am using Databricks Connect to connect from local installation of IntelliJ to the Databricks Spark Cluster. I am connected to the cluster and was able to submit a job from IntelliJ to the Cluster too. AMOF, everything else works except the below piece.
DBConnect is 6.1 , Databricks Runtime is 6.2 Imported the jar file from the cluster (using Databricks-connect get-jar-dir) and set up the SBT project with the jar in the project library
source code:
val sparkSession = SparkSession.builder.getOrCreate()
val sparkContext = sparkSession.sparkContext
import sparkSession.implicits._
val v_textFile_read = sparkContext.textFile(v_filename_path)
v_textFile_read.take(2).foreach(println)
Error:
cannot assign instance of scala.Some to field org.apache.spark.rdd.RDD.org$apache$spark$rdd$RDD$$dependencies_ of
type scala.collection.Seq in instance of org.apache.spark.rdd.HadoopRDD
The reason I use a RDD reader for text is so I can pass this output to a createDataFrame API. As you know, the createdataframe API takes in an RDD and schema as input parameters.
step-1: val v_RDD_textFile_read = sparkContext.textFile(v_filename_path).map(x => MMRSplitRowIntoStrings(x))
step-2: val v_DF_textFile_read = sparkSession.sqlContext.createDataFrame(v_RDD_textFile_read, v_schema)
(edited