Generate distinct values from a column in a spark dataframe

Question

I have a spark dataframe like below

id|name|age|sub
1 |ravi|21 |[M,J,J,K]

I don't want to explode on the column "sub" as it will create another extra set of rows. I want generate unique values from the "sub" column and assign it to new column sub_unique.

My output should be like

id|name|age|sub_unique
1 |ravi|21 |[M,J,K]

is `sub` string column ? can add dataframe schema ? – mrsrinivas Jan 05 '17 at 13:41 — mrsrinivas, Jan 05 '17 at 13:41

score 0 · Answer 1 · answered Jan 05 '17 at 12:15

0

You can use udf

val distinct = udf((x: Seq[String]) => if (s != null) x.distinct else Seq[String]())

df.withColumn("subm_unique", distinct($"sub"))

answered Jan 05 '17 at 12:15

user7337271

1,662
1
14
23

your solution did not work. I am getting below error while executing "Flat hash tables cannot contain null elements" – Puneeth Kumar Jan 05 '17 at 13:00

Generate distinct values from a column in a spark dataframe

1 Answers1