spark-dataframe pivot missing columns/values

Question

Right now I'm facing a problem that I can't solve, let me explain.

I need to pivot a spark-dataframe, but in some cases there are no records for the pivot to include the column that I need. Here is an example:

Let's say that the are 4 types of events: A, B, C, D

I receive an event logs files like this:

|Id|year|type|
|--|----|----|
|a |2015|   A|
|a |2016|   A|
|a |2015|   A|
|a |2017|   A|
|b |2015|   A|
|b |2015|   B|
|b |2016|   D|
|b |2015|   B|
|b |2017|   A|

When I do the pivot I get:

|id|year|   A|   B|   D|
|--|----|----|----|----|
|a |2015|   2|null|null|
|a |2016|   1|null|null|
|a |2017|   1|null|null|
|b |2015|   1|   2|null|
|b |2016|   1|null|   1|
|b |2017|   1|null|null|

what I really need is:

|id|year|   A|   B|   C|   D|
|--|----|----|----|----|----|
|a |2015|   2|   0|   0|   0|
|a |2016|   1|   0|   0|   0|
|a |2017|   1|   0|   0|   0|
|b |2015|   1|   2|   0|   0|
|b |2016|   1|   0|   0|   1|
|b |2017|   1|   0|   0|   0|

And imagine that on each event log file that I receive there are different type events missing, but I always need to have the columns for all of the event types (A, B, C and D).

I hope that I explained myself well.

score 2 · Accepted Answer · edited Jun 22 '21 at 21:50

2

val lscol = Seq((“A”),(“B”),(“C”),(“D”)).toDF(“Type”)

//df is the initial dataframe prior to pivot
df.join(lscol,df.col(“Type”)===lscol.col(“Type”),”right”)
    .drop(df(“Type”))
    .groupBy(“ID”,”year”)
    .pivot(“Type”)
    .agg(count(“Type”))
    .filter(!col(“ID”).isNull && !col(“year”).isNull)
    .na
    .fill(0)
    .show

edited Jun 22 '21 at 21:50

HKTonyLee

3,111
23
34

answered Aug 31 '18 at 15:43

Chandan Ray

2,031
1
10
15

Thanks a lot for your answer! So rry for the delay, I was out of the country with no internet. I'll keep you updated! – izelcp Sep 11 '18 at 19:24

spark-dataframe pivot missing columns/values

1 Answers1