UPDATE: Please hold on to this question. I found this might be a problem of Spark 1.5 itself, for I am not using the official version of Spark. I'll keep updating this question. Thank you!
I noticed a strange bug recently when using Spark-CSV to import CSV to DataFrame in Spark.
Here is my sample code:
object sparktry
{
def main(args: Array[String])
{
AutoLogger.setLevel("INFO")
val sc = SingletonSparkContext.getInstance()
val sql_context = SingletonSQLContext.getInstance(sc)
val options = new collection.mutable.HashMap[String, String]()
options += "header" -> "true"
options += "charset" -> "UTF-8"
val customSchema = StructType(Array(
StructField("Year", StringType),
StructField("Brand", StringType),
StructField("Category", StringType),
StructField("Model", StringType),
StructField("Sales", DoubleType)))
val dataFrame = sql_context.read.format("com.databricks.spark.csv")
.options(options)
.schema(customSchema)
.load("hdfs://myHDFSserver:9000/BigData/CarSales.csv")
dataFrame.head(10).foreach(x => AutoLogger.info(x.toString))
}
}
CarSales is a very small csv. I noticed that when spark.master
is not local
, setting spark.executor.memory
to above 16GB will result in the corruption of DataFrame. The output of this program is shown as below: (I copied the text from the log, and in this case spark.executor.memory
is set to 32GB)
16/03/07 12:39:50.190 INFO DAGScheduler: Job 1 finished: head at sparktry.scala:35, took 8.009183 s
16/03/07 12:39:50.225 INFO AutoLogger$: [ , , ,ries ,142490.0]
16/03/07 12:39:50.225 INFO AutoLogger$: [ , , ,ries ,112464.0]
16/03/07 12:39:50.226 INFO AutoLogger$: [ , , ,ries ,90960.0]
16/03/07 12:39:50.226 INFO AutoLogger$: [ , , ,ries ,100910.0]
16/03/07 12:39:50.226 INFO AutoLogger$: [ , , ,ries ,94371.0]
16/03/07 12:39:50.226 INFO AutoLogger$: [ , , ,ries ,54142.0]
16/03/07 12:39:50.226 INFO AutoLogger$: [ , , ,ries ,14773.0]
16/03/07 12:39:50.226 INFO AutoLogger$: [ , , ,ries ,12276.0]
16/03/07 12:39:50.227 INFO AutoLogger$: [ , , ,ries ,9254.0]
16/03/07 12:39:50.227 INFO AutoLogger$: [ , , ,ries ,12253.0]
While the first 10 lines of the file is:
1/1/2007,BMW,Compact,BMW 3-Series,142490.00
1/1/2008,BMW,Compact,BMW 3-Series,112464.00
1/1/2009,BMW,Compact,BMW 3-Series,90960.00
1/1/2010,BMW,Compact,BMW 3-Series,100910.00
1/1/2011,BMW,Compact,BMW 3-Series,94371.00
1/1/2007,BMW,Compact,BMW 5-Series,54142.00
1/1/2007,BMW,Fullsize,BMW 7-Series,14773.00
1/1/2008,BMW,Fullsize,BMW 7-Series,12276.00
1/1/2009,BMW,Fullsize,BMW 7-Series,9254.00
1/1/2010,BMW,Fullsize,BMW 7-Series,12253.00
I noticed that by only changing spark.executor.memory
to 16GB on my machine, the first 10 lines is correct, but setting it to over 16GB will result in the corruption.
What's more: On one of my servers which have 256GB's memory, setting this to 16GB also produces this bug. Instead, setting it to 48GB will make it work fine. In addition, I tried to print dataFrame.rdd
, it shows that the content of RDD is correct, while the dataframe itself is not.
Does anyone have any idea about this problem?
Thank you!