3

UPDATE: Please hold on to this question. I found this might be a problem of Spark 1.5 itself, for I am not using the official version of Spark. I'll keep updating this question. Thank you!

I noticed a strange bug recently when using Spark-CSV to import CSV to DataFrame in Spark.

Here is my sample code:

  object sparktry
  {
    def main(args: Array[String])
    {
      AutoLogger.setLevel("INFO")

      val sc = SingletonSparkContext.getInstance()
      val sql_context = SingletonSQLContext.getInstance(sc)

      val options = new collection.mutable.HashMap[String, String]()
      options += "header" -> "true"
      options += "charset" -> "UTF-8"

      val customSchema = StructType(Array(
        StructField("Year", StringType),
        StructField("Brand", StringType),
        StructField("Category", StringType),
        StructField("Model", StringType),
        StructField("Sales", DoubleType)))

      val dataFrame = sql_context.read.format("com.databricks.spark.csv")
      .options(options)
      .schema(customSchema)
      .load("hdfs://myHDFSserver:9000/BigData/CarSales.csv")

      dataFrame.head(10).foreach(x => AutoLogger.info(x.toString))
    }
  }

CarSales is a very small csv. I noticed that when spark.master is not local, setting spark.executor.memory to above 16GB will result in the corruption of DataFrame. The output of this program is shown as below: (I copied the text from the log, and in this case spark.executor.memory is set to 32GB)

16/03/07 12:39:50.190 INFO DAGScheduler: Job 1 finished: head at sparktry.scala:35, took 8.009183 s
16/03/07 12:39:50.225 INFO AutoLogger$: [       ,  ,      ,ries       ,142490.0]
16/03/07 12:39:50.225 INFO AutoLogger$: [       ,  ,      ,ries       ,112464.0]
16/03/07 12:39:50.226 INFO AutoLogger$: [       ,  ,      ,ries       ,90960.0]
16/03/07 12:39:50.226 INFO AutoLogger$: [       ,  ,      ,ries       ,100910.0]
16/03/07 12:39:50.226 INFO AutoLogger$: [       ,  ,      ,ries       ,94371.0]
16/03/07 12:39:50.226 INFO AutoLogger$: [       ,  ,      ,ries       ,54142.0]
16/03/07 12:39:50.226 INFO AutoLogger$: [       ,  ,       ,ries       ,14773.0]
16/03/07 12:39:50.226 INFO AutoLogger$: [       ,  ,       ,ries       ,12276.0]
16/03/07 12:39:50.227 INFO AutoLogger$: [       ,  ,       ,ries       ,9254.0]
16/03/07 12:39:50.227 INFO AutoLogger$: [       ,  ,       ,ries       ,12253.0]

While the first 10 lines of the file is:

1/1/2007,BMW,Compact,BMW 3-Series,142490.00
1/1/2008,BMW,Compact,BMW 3-Series,112464.00
1/1/2009,BMW,Compact,BMW 3-Series,90960.00
1/1/2010,BMW,Compact,BMW 3-Series,100910.00
1/1/2011,BMW,Compact,BMW 3-Series,94371.00
1/1/2007,BMW,Compact,BMW 5-Series,54142.00
1/1/2007,BMW,Fullsize,BMW 7-Series,14773.00
1/1/2008,BMW,Fullsize,BMW 7-Series,12276.00
1/1/2009,BMW,Fullsize,BMW 7-Series,9254.00
1/1/2010,BMW,Fullsize,BMW 7-Series,12253.00

I noticed that by only changing spark.executor.memory to 16GB on my machine, the first 10 lines is correct, but setting it to over 16GB will result in the corruption.

What's more: On one of my servers which have 256GB's memory, setting this to 16GB also produces this bug. Instead, setting it to 48GB will make it work fine. In addition, I tried to print dataFrame.rdd, it shows that the content of RDD is correct, while the dataframe itself is not.

Does anyone have any idea about this problem?

Thank you!

DarkZero
  • 2,259
  • 3
  • 25
  • 36

2 Answers2

1

It turns out to be a bug in serializing with Kyro in Spark 1.5.1 & 1.5.2.

https://github.com/databricks/spark-csv/issues/285#issuecomment-193633716

This is fixed in 1.6.0. It has nothing to do with spark-csv.

DarkZero
  • 2,259
  • 3
  • 25
  • 36
0

I ran your code and able to fetch csv data from hdfs with default configuration of Spark.

I updated your code for below lines:

val conf = new org.apache.spark.SparkConf().setMaster("local[2]").setAppName("HDFSReadDemo");
val sc = new org.apache.spark.SparkContext(conf); 
val sql_context = new org.apache.spark.sql.SQLContext(sc) 

And println() in place of logger.

dataFrame.head(10).foreach(x => println(x))

So nothing should be wrong with Spark memory configuration (i.e. spark.executor.memory)

Mahendra
  • 1,436
  • 9
  • 15
  • Thank you for your answer. In fact, this bug won't appear if Spark run in local mode. I'm sorry to miss that... – DarkZero Mar 08 '16 at 01:41