I have found similar post here, but some extra issue appear when I apply this to String variable. Let me explain what I am trying to do. I have a single column DataFrame df1 which contains some place information:
+-------+
|place |
+-------+
|Place A|
|Place B|
|Place C|
+-------+
And another DataFrame df2 as following:
+--+-------+
|id|place |
+--+-------+
|1| Place A|
|2| Place C|
|3| Place C|
|4| Place B|
I need to loop over df2 to check which place does each id match, and do something on the matched ids. The code snippet is as following:
val places = df1.distinct.map(_.toString).collect
for (place <- places){
val students = df2.where(s"place = '$place'").select("id","place")
// do something on students (add some unique columns depending the place)
students.show(2)
}
The error I got is a SQL ParseException:
extraneous input '[' expecting {'(', ....}
== SQL ==
academic_college = [Place A]
-------------------^^^
My understanding now is that this Parse Exception comes from the places Array after I do the collect operation. It inherently contains "[]":
places = Array([Place A], [Place B], [Place C])
My questions are two folds:
I only know how to collect df1 into Array and loop over it to achieve what I want since the operations to each place is different. If we stay with this approach, what is the best way to remove "[]" or changed it to "()" or do something else to resolve the Parse Exception?
Is there any better way to achieve this without collecting (materialize) df1 and keep everything in DataFrame?