Characters get corrupt if spark.executor.memory is not set properly when importing CSV to DataFrame

Question

UPDATE: Please hold on to this question. I found this might be a problem of Spark 1.5 itself, for I am not using the official version of Spark. I'll keep updating this question. Thank you!

I noticed a strange bug recently when using Spark-CSV to import CSV to DataFrame in Spark.

Here is my sample code:

  object sparktry
  {
    def main(args: Array[String])
    {
      AutoLogger.setLevel("INFO")

      val sc = SingletonSparkContext.getInstance()
      val sql_context = SingletonSQLContext.getInstance(sc)

      val options = new collection.mutable.HashMap[String, String]()
      options += "header" -> "true"
      options += "charset" -> "UTF-8"

      val customSchema = StructType(Array(
        StructField("Year", StringType),
        StructField("Brand", StringType),
        StructField("Category", StringType),
        StructField("Model", StringType),
        StructField("Sales", DoubleType)))

      val dataFrame = sql_context.read.format("com.databricks.spark.csv")
      .options(options)
      .schema(customSchema)
      .load("hdfs://myHDFSserver:9000/BigData/CarSales.csv")

      dataFrame.head(10).foreach(x => AutoLogger.info(x.toString))
    }
  }

CarSales is a very small csv. I noticed that when spark.master is not local, setting spark.executor.memory to above 16GB will result in the corruption of DataFrame. The output of this program is shown as below: (I copied the text from the log, and in this case spark.executor.memory is set to 32GB)

16/03/07 12:39:50.190 INFO DAGScheduler: Job 1 finished: head at sparktry.scala:35, took 8.009183 s
16/03/07 12:39:50.225 INFO AutoLogger$: [       ,  ,      ,ries       ,142490.0]
16/03/07 12:39:50.225 INFO AutoLogger$: [       ,  ,      ,ries       ,112464.0]
16/03/07 12:39:50.226 INFO AutoLogger$: [       ,  ,      ,ries       ,90960.0]
16/03/07 12:39:50.226 INFO AutoLogger$: [       ,  ,      ,ries       ,100910.0]
16/03/07 12:39:50.226 INFO AutoLogger$: [       ,  ,      ,ries       ,94371.0]
16/03/07 12:39:50.226 INFO AutoLogger$: [       ,  ,      ,ries       ,54142.0]
16/03/07 12:39:50.226 INFO AutoLogger$: [       ,  ,       ,ries       ,14773.0]
16/03/07 12:39:50.226 INFO AutoLogger$: [       ,  ,       ,ries       ,12276.0]
16/03/07 12:39:50.227 INFO AutoLogger$: [       ,  ,       ,ries       ,9254.0]
16/03/07 12:39:50.227 INFO AutoLogger$: [       ,  ,       ,ries       ,12253.0]

While the first 10 lines of the file is:

1/1/2007,BMW,Compact,BMW 3-Series,142490.00
1/1/2008,BMW,Compact,BMW 3-Series,112464.00
1/1/2009,BMW,Compact,BMW 3-Series,90960.00
1/1/2010,BMW,Compact,BMW 3-Series,100910.00
1/1/2011,BMW,Compact,BMW 3-Series,94371.00
1/1/2007,BMW,Compact,BMW 5-Series,54142.00
1/1/2007,BMW,Fullsize,BMW 7-Series,14773.00
1/1/2008,BMW,Fullsize,BMW 7-Series,12276.00
1/1/2009,BMW,Fullsize,BMW 7-Series,9254.00
1/1/2010,BMW,Fullsize,BMW 7-Series,12253.00

I noticed that by only changing spark.executor.memory to 16GB on my machine, the first 10 lines is correct, but setting it to over 16GB will result in the corruption.

What's more: On one of my servers which have 256GB's memory, setting this to 16GB also produces this bug. Instead, setting it to 48GB will make it work fine. In addition, I tried to print dataFrame.rdd, it shows that the content of RDD is correct, while the dataframe itself is not.

Does anyone have any idea about this problem?

Thank you!

score 1 · Accepted Answer · answered Mar 08 '16 at 07:52

1

It turns out to be a bug in serializing with Kyro in Spark 1.5.1 & 1.5.2.

https://github.com/databricks/spark-csv/issues/285#issuecomment-193633716

This is fixed in 1.6.0. It has nothing to do with spark-csv.

answered Mar 08 '16 at 07:52

DarkZero

2,259
3
25
36

score 0 · Answer 2 · answered Mar 07 '16 at 15:11

I ran your code and able to fetch csv data from hdfs with default configuration of Spark.

I updated your code for below lines:

val conf = new org.apache.spark.SparkConf().setMaster("local[2]").setAppName("HDFSReadDemo");
val sc = new org.apache.spark.SparkContext(conf); 
val sql_context = new org.apache.spark.sql.SQLContext(sc)

And println() in place of logger.

dataFrame.head(10).foreach(x => println(x))

So nothing should be wrong with Spark memory configuration (i.e. spark.executor.memory)

Thank you for your answer. In fact, this bug won't appear if Spark run in local mode. I'm sorry to miss that... — DarkZero, Mar 08 '16 at 01:41

Characters get corrupt if spark.executor.memory is not set properly when importing CSV to DataFrame

2 Answers2