0

I am running a simple query in spark 3.2

val df1 = sc.parallelize(List((1,2,3,4,5),(1,2,3,4,5))).toDF("id","col2","col3","col4", "col5")
val op_cols_same_case = List("id","col2","col3","col4", "col5", "id")
val df2 = df1.select(op_cols_same_case.head, op_cols_same_case.tail: _*)
df2.select("id").show() 

The above query return the result, but when I mix the casing it gives exception

val df1 = sc.parallelize(List((1,2,3,4,5),(1,2,3,4,5))).toDF("id","col2","col3","col4", "col5")
val op_cols_diff_case = List("id","col2","col3","col4", "col5", "ID")
val df2 = df1.select(op_cols_diff_case.head, op_cols_diff_case.tail: _*)
df2.select("id").show() 

In my test caseSensitive was default (false).
I expect both queries to return the result. Or both queries to fail.
Why is it failing for one and not for the other one?

Koedlt
  • 4,286
  • 8
  • 15
  • 33
ASR
  • 53
  • 6
  • Note: spark.sql.caseSensitive controls whether column names are treated as case-sensitive or case-insensitive, it does not affect the behavior of DataFrame API methods that take column names as arguments, such as df.select("id"). These methods treat column names as case-sensitive regardless of the spark.sql.caseSensitive setting, and will raise an exception if there are two columns with the same name in different cases. – ASR Feb 23 '23 at 17:49

1 Answers1

0

We see this as an issue or non-issue based on what seems logical to one. There is a long thread on this pull request, where some believe it to be correct while some think its wrong.

But the pull request changes do make the behavior consistent.

user16217248
  • 3,119
  • 19
  • 19
  • 37
ASR
  • 53
  • 6