-1

I have a dataframe which contains 4 columns.

Dataframe sample

id1 id2 id3 id4
---------------
a1  a2  a3  a4
b1  b2  b3  b4
b1  b2  b3  b4
c1  c2  c3  c4
    b2      
c1
        a3
            a4
c1
        d4

There are 2 types of data in a row either all the columns have data or only one column.

I want to perform distinct function on all the columns such as while comparing the values between rows, it will only compare the value which is present in a row and don't consider the null values.

Output dataframe should be

id1 id2 id3 id4
a1  a2  a3  a4
b1  b2  b3  b4
c1  c2  c3  c4
        d4

I have looked multiple examples of UDAF in spark. But not able to modified according.

Kaushal
  • 3,237
  • 3
  • 29
  • 48

2 Answers2

1

you can use filter for all the columns as below

df.filter($"id1" =!= "" && $"id2" =!= "" && $"id3" =!= "" && $"id4" =!= "")

and you should get your final dataframe.

The above code is for static four columned dataframe. If you have more than four columns above method would become hectic as you would have to write too many logic checkings.

the solution to that would be to use a udf function as below

import org.apache.spark.sql.functions._
def checkIfNull = udf((co : mutable.WrappedArray[String]) => !(co.contains(null) || co.contains("")))
df.filter(checkIfNull(array(df.columns.map(col): _*))).show(false)

I hope the answer is helpful

Ramesh Maharjan
  • 41,071
  • 6
  • 69
  • 97
  • Thanks for the ans but first i need to check value of the one row is present on another rows or not then only remove it otherwise keep that row. see the modified question. – Kaushal Oct 19 '17 at 04:40
1

It is possible to take advantage of that dropDuplicates is order dependent to solve this, see the answer here. However, it is not very efficient, there should be a more efficient solution.

First remove all duplicates with distinct(), then iteratively order by each column and drop it's duplicates. The columns are ordered in descending order as nulls then will be put last.

Example with four static columns:

val df2 = df.distinct()
  .orderBy($"id1".desc).dropDuplicates("id1")
  .orderBy($"id2".desc).dropDuplicates("id2")
  .orderBy($"id3".desc).dropDuplicates("id3")
  .orderBy($"id4".desc).dropDuplicates("id4")
Shaido
  • 27,497
  • 23
  • 70
  • 73
  • Thanks for the ans but first i need to check value of the one row is present on another rows or not then only remove it otherwise keep that row. see the modified question. – Kaushal Oct 19 '17 at 04:40
  • And empty columns contains `null` only, no `""` string – Kaushal Oct 19 '17 at 04:42
  • @Kaushal Updated with an answer that will work, however, probably not the best solution possible. – Shaido Oct 19 '17 at 05:34
  • yea correct, this is the way to achieve this, but I'm thinking can we achieve this with single shuffle. I mean I can write a custom distinct UDAF for the same – Kaushal Oct 19 '17 at 07:42