RDD toDF() : Erroneous Behavior

Question

I built a SparkStreaming App that fetches content from A Kafka Queue and intends to put the data into a MySQL table after some pre-processing and structuring.

I call the 'foreachRDD' method on the SparkStreamingContext. The issue that I'm facing is that there's dataloss between when I call saveAsTextFile on the RDD and DataFrame's write method with format("csv"). I can't seem to pin point why this is happening.

val ssc = new StreamingContext(spark.sparkContext, Seconds(60))
ssc.checkpoint("checkpoint")

val topicMap = topics.split(",").map((_, numThreads.toInt)).toMap
val stream = KafkaUtils.createStream(ssc, zkQuorum, group, topicMap).map(_._2)
    stream.foreachRDD {
     rdd => {
     rdd.saveAsTextFile("/Users/jarvis/rdds/"+new SimpleDateFormat("hh-mm-ss-dd-MM-yyyy").format(new Date)+"_rdd")

     import spark.implicits._

      val messagesDF = rdd.map(_.split("\t")).map( w => { Record ( w(0), autoTag( w(1),w(4) ) , w(2), w(3), w(4), w(5).substring(w(5).lastIndexOf("http://")), w(6).split("\n")(0) )}).toDF("recordTS","tag","channel_url","title","description","link","pub_TS")

      messagesDF.write.format("csv").save(dumpPath+new SimpleDateFormat("hh-mm-ss-dd-MM-yyyy").format(new Date)+"_DF")
      }
    }

    ssc.start()
    ssc.awaitTermination()

There's data loss ie Many rows don't make it to the DataFrame from the RDD. There's also replication: Many rows that do reach the Dataframe are replicated many times.

what you can do is, convert the `rdd` to `df` first, then you can write the same DF to `csv` as well as `text` file. To save the df to text file try `df.write.text("file path")` — Shankar, Oct 27 '16 at 12:19
Also, you can `cache` the DF before writing to CSV and Text file. — Shankar, Oct 27 '16 at 12:21

score 0 · Answer 1 · answered Oct 29 '16 at 08:57

Found the error. Actually there was a wrong understanding about the ingested data format.

The intended data was "\t\t\t..." and hence the Row was supposed be split at "\n".

However the actual data was : "\t\t\t...\n\t\t\t...\n"

So the rdd.map(...) operation needed another map for splitting at every "\n"

RDD toDF() : Erroneous Behavior

1 Answers1