3

I'm trying to read multiline json message on Spark 2.0.0., but I'm getting _corrupt_record. The code works fine for a single line json and when I'm trying to read the multiline json it as wholetextfile in REPL.

stream.map(record => (record.key(), record.value())).foreachRDD(rdd => {
  if (!rdd.isEmpty()) {
    logger.info("----Start of the PersistIPDataRecords Batch processing------")
    //taking only value part of each RDD
    val newRDD = rdd.map(x => x._2.toString())

    logger.info("--------------------Before Loop-----------------")
    newRDD.foreach(println)
    import spark.implicits._
    val df = spark.read.option("multiLine", true).option("mode", "PERMISSIVE").json(newRDD).printSchema()
    logger.info("----Converting RDD to Dataframe-----")
  } else logger.info("---------No data received in RDD-----------")
})
ssc.start()
ssc.awaitTermination()

When I try reading it as file in REPL it works fine

scala> val df=spark.read.json(spark.sparkContext.wholeTextFiles("/user/maria_dev/jsondata/employees_multiLine.json").values)

JSON file:

{"empno":"7369", "ename":"SMITH", "designation":"CLERK", "manager":"7902", "hire_date":"12/17/1980", "sal":"800", "deptno":"20"}
Michael Heil
  • 16,250
  • 3
  • 42
  • 77

0 Answers0