Getting "Task not serializable: java.io.NotSerializableException" while trying to use udf and to_json together

Question

I have been trying to figure out where the exact problem is but unable to do so. Tried to follow something like this also UDF to generate JSON string behaving inconsistently but still unable to understand the problem.

Below is my code snippet,

    val writingDataset = sparkSession
      .readStream
      .format("kafka")
      .option(kafkaBootstrapServers, urls)
      .option("subscribe", inputTopics)
      .option("startingOffsets", "earliest")
      .load()
      .selectExpr("CAST(key AS STRING)", "CAST(value AS STRING)")
//      .withColumn("value", parser.parseUDF('value).as("value")) //combination of this two line doesn't work either
//      .withColumn("value", to_json('value).as("value")) //combination of this two line doesn't work either
      .select(col("key"), to_json(parser.parseUDF('value)).as("value"))
      .writeStream
        .format("console")
        .start()

    writingDataset.awaitTermination

Below is my code for the udf

    val parse = (value: String) => {
    Some(CompanyDetail("something", "something"))
  }

  import org.apache.spark.sql.functions.udf
  val parseUDF = udf(parse)
  val keyUDF = udf(keyParse)

Not sure what is happening here, but I keep getting the below error

org.apache.spark.SparkException: Writing job aborted.
    at org.apache.spark.sql.execution.datasources.v2.WriteToDataSourceV2Exec.doExecute(WriteToDataSourceV2Exec.scala:92)
    at org.apache.spark.sql.execution.SparkPlan.$anonfun$execute$1(SparkPlan.scala:131)
    at org.apache.spark.sql.execution.SparkPlan$$Lambda$7251/0000000000000000.apply(Unknown Source)
    at org.apache.spark.sql.execution.SparkPlan.$anonfun$executeQuery$1(SparkPlan.scala:155)
    at org.apache.spark.sql.execution.SparkPlan$$Lambda$7280/0000000000000000.apply(Unknown Source)
    at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
    at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152)
    at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127)
    at org.apache.spark.sql.execution.SparkPlan.getByteArrayRdd(SparkPlan.scala:247)
    at org.apache.spark.sql.execution.SparkPlan.executeCollect(SparkPlan.scala:296)
    at org.apache.spark.sql.Dataset.collectFromPlan(Dataset.scala:3383)
    at org.apache.spark.sql.Dataset.$anonfun$collect$1(Dataset.scala:2782)
    at org.apache.spark.sql.Dataset$$Lambda$7166/000000006C38DB10.apply(Unknown Source)
    at org.apache.spark.sql.Dataset.$anonfun$withAction$2(Dataset.scala:3364)
    at org.apache.spark.sql.Dataset$$Lambda$7169/000000006C38F210.apply(Unknown Source)
    at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:78)
    at org.apache.spark.sql.execution.SQLExecution$$$Lambda$7140/000000006C25F080.apply(Unknown Source)
    at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:125)
    at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:73)
    at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3364)
    at org.apache.spark.sql.Dataset.collect(Dataset.scala:2782)
    at org.apache.spark.sql.execution.streaming.MicroBatchExecution.$anonfun$runBatch$15(MicroBatchExecution.scala:540)
    at org.apache.spark.sql.execution.streaming.MicroBatchExecution$$Lambda$7136/000000006C25E1B0.apply(Unknown Source)
    at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:78)
    at org.apache.spark.sql.execution.SQLExecution$$$Lambda$7140/000000006C25F080.apply(Unknown Source)
    at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:125)
    at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:73)
    at org.apache.spark.sql.execution.streaming.MicroBatchExecution.$anonfun$runBatch$14(MicroBatchExecution.scala:536)
    at org.apache.spark.sql.execution.streaming.MicroBatchExecution$$Lambda$7135/000000006C25DA80.apply(Unknown Source)
    at org.apache.spark.sql.execution.streaming.ProgressReporter.reportTimeTaken(ProgressReporter.scala:351)
    at org.apache.spark.sql.execution.streaming.ProgressReporter.reportTimeTaken$(ProgressReporter.scala:349)
    at org.apache.spark.sql.execution.streaming.StreamExecution.reportTimeTaken(StreamExecution.scala:58)
    at org.apache.spark.sql.execution.streaming.MicroBatchExecution.runBatch(MicroBatchExecution.scala:535)
    at org.apache.spark.sql.execution.streaming.MicroBatchExecution.$anonfun$runActivatedStream$2(MicroBatchExecution.scala:198)
    at org.apache.spark.sql.execution.streaming.MicroBatchExecution$$Lambda$6895/000000006C02DE80.apply$mcV$sp(Unknown Source)
    at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:12)
    at org.apache.spark.sql.execution.streaming.ProgressReporter.reportTimeTaken(ProgressReporter.scala:351)
    at org.apache.spark.sql.execution.streaming.ProgressReporter.reportTimeTaken$(ProgressReporter.scala:349)
    at org.apache.spark.sql.execution.streaming.StreamExecution.reportTimeTaken(StreamExecution.scala:58)
    at org.apache.spark.sql.execution.streaming.MicroBatchExecution.$anonfun$runActivatedStream$1(MicroBatchExecution.scala:166)
    at org.apache.spark.sql.execution.streaming.MicroBatchExecution$$Lambda$6893/000000006C02CF10.apply$mcZ$sp(Unknown Source)
    at org.apache.spark.sql.execution.streaming.ProcessingTimeExecutor.execute(TriggerExecutor.scala:56)
    at org.apache.spark.sql.execution.streaming.MicroBatchExecution.runActivatedStream(MicroBatchExecution.scala:160)
    at org.apache.spark.sql.execution.streaming.StreamExecution.org$apache$spark$sql$execution$streaming$StreamExecution$$runStream(StreamExecution.scala:281)
    at org.apache.spark.sql.execution.streaming.StreamExecution$$anon$1.run(StreamExecution.scala:193)
Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: Task not serializable: java.io.NotSerializableException: scala.runtime.LazyRef

score 0 · Accepted Answer · answered Sep 15 '20 at 13:44

0

I figured it out myself. There isn't any problem with the spark code. Its a problem with the scala version. It worked once I downgrade the scala version to 2.11.0

answered Sep 15 '20 at 13:44

Pranjut

1,747
5
18
31

Getting "Task not serializable: java.io.NotSerializableException" while trying to use udf and to_json together

1 Answers1