Why different behavior when mixed case are used, vs same case are used in spark 3.2

Question

I am running a simple query in spark 3.2

val df1 = sc.parallelize(List((1,2,3,4,5),(1,2,3,4,5))).toDF("id","col2","col3","col4", "col5")
val op_cols_same_case = List("id","col2","col3","col4", "col5", "id")
val df2 = df1.select(op_cols_same_case.head, op_cols_same_case.tail: _*)
df2.select("id").show()

The above query return the result, but when I mix the casing it gives exception

val df1 = sc.parallelize(List((1,2,3,4,5),(1,2,3,4,5))).toDF("id","col2","col3","col4", "col5")
val op_cols_diff_case = List("id","col2","col3","col4", "col5", "ID")
val df2 = df1.select(op_cols_diff_case.head, op_cols_diff_case.tail: _*)
df2.select("id").show()

In my test caseSensitive was default (false).
I expect both queries to return the result. Or both queries to fail.
Why is it failing for one and not for the other one?

Note: spark.sql.caseSensitive controls whether column names are treated as case-sensitive or case-insensitive, it does not affect the behavior of DataFrame API methods that take column names as arguments, such as df.select("id"). These methods treat column names as case-sensitive regardless of the spark.sql.caseSensitive setting, and will raise an exception if there are two columns with the same name in different cases. — ASR, Feb 23 '23 at 17:49

score 0 · Accepted Answer · edited Apr 05 '23 at 02:24

0

We see this as an issue or non-issue based on what seems logical to one. There is a long thread on this pull request, where some believe it to be correct while some think its wrong.

But the pull request changes do make the behavior consistent.

edited Apr 05 '23 at 02:24

user16217248

3,119
19
19
37

answered Mar 30 '23 at 17:47

ASR

53
6

Why different behavior when mixed case are used, vs same case are used in spark 3.2

1 Answers1