0

I have found similar post here, but some extra issue appear when I apply this to String variable. Let me explain what I am trying to do. I have a single column DataFrame df1 which contains some place information:

+-------+
|place  |
+-------+
|Place A|
|Place B|
|Place C|
+-------+ 

And another DataFrame df2 as following:

+--+-------+
|id|place  |
+--+-------+
|1| Place A|
|2| Place C|
|3| Place C|
|4| Place B|

I need to loop over df2 to check which place does each id match, and do something on the matched ids. The code snippet is as following:

  val places = df1.distinct.map(_.toString).collect
  for (place <- places){
    val students = df2.where(s"place = '$place'").select("id","place")
    // do something on students (add some unique columns depending the place)
    students.show(2)
} 

The error I got is a SQL ParseException:

extraneous input '[' expecting {'(', ....}
== SQL ==
academic_college = [Place A]
-------------------^^^

My understanding now is that this Parse Exception comes from the places Array after I do the collect operation. It inherently contains "[]":

places = Array([Place A], [Place B], [Place C])

My questions are two folds:

  1. I only know how to collect df1 into Array and loop over it to achieve what I want since the operations to each place is different. If we stay with this approach, what is the best way to remove "[]" or changed it to "()" or do something else to resolve the Parse Exception?

  2. Is there any better way to achieve this without collecting (materialize) df1 and keep everything in DataFrame?

Guanghua Shu
  • 95
  • 4
  • 14

2 Answers2

1

You can get Array[String] from df1 as

val places = df1.distinct().collect().map(_.getString(0))

Now you can select each from the array as

places.foreach(place => {
  val student = df2.where($"place" === place).select("id","place")
  student.show()
})

But make sure this won't efect in your original dataframe.

If df1 is small and can fit in your memory you can collect it in a driver, otherwise, it is not suggested.

If you provide some input and expected output, you could get more help easily.

koiralo
  • 22,594
  • 6
  • 51
  • 72
  • Thanks Shankar, this almost solves the problem. I got a different Parse Exception with this: `extraneous input 'Designated' expecting (line 1, pos 30) == SQL == place = No Place Designated ------------------------^^^` Any idea why? (No Place Designated is another option in df1 which I did not list out before, and this happens to be the first on after collect) – Guanghua Shu Apr 06 '18 at 05:48
  • for (place <- places) { val students = df2.where(s"place = $place") .select("id", "place") students.show(2) } This is what I have tried with the above Parse Exception. – Guanghua Shu Apr 06 '18 at 17:43
  • 1
    you can use foreach as i did above and use === for comparing – koiralo Apr 06 '18 at 17:44
  • Using "===" for comparison does seem to work. I do not know enough to know the exact reason. Is it because SQL expression always looking for a ? Do you mind explain a bit more why one works and the other does not? Thanks. – Guanghua Shu Apr 06 '18 at 17:56
  • `===` is equality check between two columns which is equivalent to `equalTo` function, `=` is an assignment operator which is not used for checking equal. – koiralo Apr 06 '18 at 18:02
  • Sure, I understand this. But inside .where("SQL expression"), it is okay to use "=" to check equality of string, right? That works when I do not have variable with pure string, such as .where("place = 'Place A'"). – Guanghua Shu Apr 06 '18 at 18:07
  • Yeah that won't work because it can't recognize as a sql command. – koiralo Apr 06 '18 at 18:20
  • You can register a table before for the dataframe and perform a query as in sql with `sql()` method if you want in sql way – koiralo Apr 06 '18 at 18:21
1

I need to loop over df2 to check which place does each id match, and do something on the matched ids.

collect() and iterating over collected data is expensive as all processing occurs in driver node.

I would suggest you to use join

lets say you have

df1
+-------+
|place  |
+-------+
|Place A|
|Place B|
+-------+

and

df2
+---+-------+
|id |place  |
+---+-------+
|1  |Place A|
|2  |Place C|
|3  |Place C|
|4  |Place B|
+---+-------+

You can get the matching place with id using join as

df2.join(df1, Seq("place"))

which should give you

+-------+---+
|place  |id |
+-------+---+
|Place A|1  |
|Place B|4  |
+-------+---+

And now you can perform your do something on the matched ids on this dataframe.

I hope the answer is helpful

Ramesh Maharjan
  • 41,071
  • 6
  • 69
  • 97
  • Thanks Ramesh, I agree that collect and for is not the best way. Originally, I thought groupBy("place") could be a solution, but I have fairly complicated operation after groupBy. For example, I will use the "id" in each group as the sourceIds for the ParallelPersonalizePageRank and add a column about the PPR rank to each group. I was not familiar with the RelationalGroup after groupBy operation. Do you think this can be done with groupBy followed by the operations that I described? – Guanghua Shu Apr 06 '18 at 05:56
  • 1
    Yes you can definitely do that. you can performa aggregation on the ids and pass to pagerank. But for that I will have to know the details of the ParallelPersonalizePageRank and that would be another question I guess. – Ramesh Maharjan Apr 06 '18 at 06:57
  • Sorry Ramesh, I had to accept Shankar's answer for this question since it is more directly related for the collect and for method I am trying this time. I will try to do it with groupBy later. I am sure I will face questions for that. I post a more direct question for that method later. Thank you for you confirmation that groupBy is also a viable direction to try. – Guanghua Shu Apr 09 '18 at 01:05
  • @GuanghuaShu thats alright. I don't want my answer to be accepted if there's better and helpful answer to meet your needs. :) – Ramesh Maharjan Apr 09 '18 at 01:52