2

I have multiple database rows per personId with columns that may or may not have values - I'm using colors here as the data is text not numeric so doesn't lend itself to built-in aggregation functions. A simplified example is

PersonId    ColA    ColB    ColB
100         red
100                 green
100                         gold
100         green
110                 yellow
110         white
110   
120         
etc...

I want to be able to decide in a function which column data to use per unique PersonId. A three-way join on the table against itself would be a good solution if the data didn't have multiple values(colors) per column. E.g. that join merges 3 of the rows into one but still produces multiple rows.

PersonId    ColA    ColB    ColB
100         red     green   gold
100         green                                   
110         white   yellow
110   
120

So the solution I'm looking for is something that will allow me to address all the values (colors) for a person in one place (function) so the decision can be made across all their data. The real data of course has more columns but the primary ones for this decision are the three columns. The data is being read in Scala Spark as a Dataframe and I'd prefer using the API to sql. I don't know if any of the exotic windows or groupby functions will help or if it's gonna be down to plain old iterate and accumulate. The technique used in [How to aggregate values into collection after groupBy? might be applicable but it's a bit of a leap.

Community
  • 1
  • 1
wholeroll
  • 193
  • 1
  • 10
  • I don't have time for a full answer, but have you looked at lag/lead or other window operations – Justin Pihony Jan 20 '17 at 16:32
  • why in this example the second line for 100 is green and the first is red green gold instead of the first being red and the second green green gold? can you explain more on how you want to combine them? – Assaf Mendelson Jan 20 '17 at 16:54
  • That was just showing what the 3-way join might produce. I initially thought the 3-way (self) join would just return 1 row - which it does unless each column has different values in other rows. The real issue is simply coping with the data spread across multiple rows and columns. – wholeroll Jan 20 '17 at 17:22
  • I have mainly worked with RDDs in Java Spark, and in that environment, you can first do a 'flatMap' to map this RDD into a paired RDD where you have PersonID as the key and one other column that represent the colors associated with the key. Then you may 'AggragateByKey' and will get what you need. If you think this approach may help, I can elaborate. – TrnKh Jan 20 '17 at 18:32
  • Please do, this sounds very promising - although I need to have three pieces of information the PersonId, color and which column the color was in. Which color is chosen in which column is significant. – wholeroll Jan 22 '17 at 11:07

1 Answers1

1

Think of using customUDF for doing this.

import org.apache.spark.sql.functions._
val df = Seq((100, "red", null, null), (100, null, "white", null), (100, null, null, "green"), (200, null, "red", null)).toDF("PID", "A", "B", "C")

df.show()
+---+----+-----+-----+
|PID|   A|    B|    C|
+---+----+-----+-----+
|100| red| null| null|
|100|null|white| null|
|100|null| null|green|
|200|null|  red| null|
+---+----+-----+-----+

val customUDF = udf((array: Seq[String]) => {
    val newts = array.filter(_.nonEmpty)
    if  (newts.size == 0) null
    else newts.head
})

df.groupBy($"PID").agg(customUDF(collect_set($"A")).as("colA"), customUDF(collect_set($"B")).as("colB"), customUDF(collect_set($"C")).as("colC")).show

+---+----+-----+-----+
|PID|colA| colB| colC|
+---+----+-----+-----+
|100| red|white|green|
|200|null|  red| null|
+---+----+-----+-----+


Sarath Chandra Vema
  • 792
  • 1
  • 6
  • 13