I'm trying to read multiline json message on Spark 2.0.0., but I'm getting _corrupt_record. The code works fine for a single line json and when I'm trying to read the multiline json it as wholetextfile in REPL.
stream.map(record => (record.key(), record.value())).foreachRDD(rdd => {
if (!rdd.isEmpty()) {
logger.info("----Start of the PersistIPDataRecords Batch processing------")
//taking only value part of each RDD
val newRDD = rdd.map(x => x._2.toString())
logger.info("--------------------Before Loop-----------------")
newRDD.foreach(println)
import spark.implicits._
val df = spark.read.option("multiLine", true).option("mode", "PERMISSIVE").json(newRDD).printSchema()
logger.info("----Converting RDD to Dataframe-----")
} else logger.info("---------No data received in RDD-----------")
})
ssc.start()
ssc.awaitTermination()
When I try reading it as file in REPL it works fine
scala> val df=spark.read.json(spark.sparkContext.wholeTextFiles("/user/maria_dev/jsondata/employees_multiLine.json").values)
JSON file:
{"empno":"7369", "ename":"SMITH", "designation":"CLERK", "manager":"7902", "hire_date":"12/17/1980", "sal":"800", "deptno":"20"}