0

I'am using pyspark I have a table like this:

id |  ClientNum  | Value |      Date     | Age   |   Country  |   Job
 1 |      19     |   A   |   1483695000  |  21   |    null    |   null
 2 |      19     |   A   |   1483696500  |  21   |    France  |   null
 3 |      19     |   A   |   1483697800  |  21   |    France  |  Engineer
 4 |      19     |   B   |   1483699000  |  21   |    null    |   null
 5 |      19     |   B   |   1483699500  |  21   |    France  |   null
 6 |      19     |   B   |   1483699800  |  21   |    France  |  Engineer
 7 |      24     |   C   |   1483699200  |  null |    null    |   null
 8 |      24     |   D   |   1483699560  |  28   |    Spain   |   null
 9 |      24     |   D   |   1483699840  |  28   |    Spain   |  Student

Based on the column Value, i want to keep for each ClientNum distinct values where the most informations (Age,Country,Job) are specified.

the result is supposed to be something like this:

   ClientNum  | Value |      Date     | Age   |   Country  |   Job
       19     |   A   |   1483697800  |  21   |    France  |  Engineer
       19     |   B   |   1483699800  |  21   |    France  |  Engineer
       24     |   C   |   1483699200  | null  |    null    |   null
       24     |   D   |   1483699840  |  28   |    Spain   |  Student

Thanks !

mtoto
  • 23,919
  • 4
  • 58
  • 71
Omar14
  • 2,007
  • 4
  • 21
  • 34
  • [Try this answer](http://stackoverflow.com/questions/38649793/how-to-get-distinct-rows-in-dataframe-using-pyspark) and [Also see this too](http://stackoverflow.com/questions/39287729/filter-rows-by-distinct-values-in-one-column-in-pyspark) – ARr0w Apr 10 '17 at 10:11
  • I can't do it with df.distinct() or df.drop_duplicates(), all rows are distinct on my exemple. I want to keep the distinct values only. – Omar14 Apr 10 '17 at 10:17
  • this is what these answers are about. to get you distinct values that you want to keep. – ARr0w Apr 10 '17 at 10:19
  • This solution works for 2 columns, to get distict rows. But in the case we have multiple columns, distinct will keep all the columns beacause each row is different. I want distinct only for the column Value and keep the other columns. – Omar14 Apr 10 '17 at 10:24

3 Answers3

3

Here's an approach using an udf to calculate the number of non-null values per row, and subsequently filter your data using Window functions:

Let's first define the udf that takes an array of columns as argument, and gives us the number of non-null values as result.

from pyspark.sql.functions import array

def nullcounter(arr):

  res = [x for x in arr if x != None]
  return(len(res))

nullcounter_udf = udf(nullcounter)

Let's add this column to the your data:

df = df.withColumn("counter", nullcounter_udf(array(df.columns)))

Now we can partition your data by ClientNum and Value, and keep the rows with the highest counter value:

from pyspark.sql.window import Window
from pyspark.sql.functions import rank, col

window = Window.partitionBy(df['ClientNum'], df['Value']).orderBy(df['counter'].desc())

df.select('*', rank().over(window).alias('rank')) \
  .filter(col('rank') == 1) \
  .sort('Value') \
  .show() 
+---+---------+-----+----------+----+-------+--------+-------+----+
| id|ClientNum|Value|      Date| Age|Country|     Job|counter|rank|
+---+---------+-----+----------+----+-------+--------+-------+----+
|  3|       19|    A|1483697800|  21| France|Engineer|      8|   1|
|  6|       19|    B|1483699800|  21| France|Engineer|      8|   1|
|  7|       24|    C|1483699200|null|   null|    null|      5|   1|
|  9|       24|    D|1483699840|  28|  Spain| Student|      8|   1|
+---+---------+-----+----------+----+-------+--------+-------+----+

Data

df = sc.parallelize([(1, 19, "A", 1483695000, 21, None, None),
(2, 19, "A", 1483696500, 21, "France", None),
(3, 19, "A", 1483697800, 21, "France", "Engineer"),
(4, 19, "B", 1483699000, 21, None, None),
(5, 19, "B", 1483699500, 21, "France", None),
(6, 19, "B", 1483699800, 21, "France", "Engineer"),
(7, 24, "C", 1483699200, None, None, None),
(8, 24, "D", 1483699560, 28, "Spain", None),
(9, 24, "D", 1483699840, 28, "Spain", "Student")]).toDF(["id","ClientNum","Value","Date","Age", "Country", "Job"])
mtoto
  • 23,919
  • 4
  • 58
  • 71
  • 1
    Thanks but with the array all the columns must have the type i think. due to data type mismatch: input to function array should all be the same type. – Omar14 Apr 10 '17 at 12:17
  • it will coerce your values to string, but that's unimportant for your use case, because we only use it as an intermediate step to calculate the length of non-null values. what spark version are you on? – mtoto Apr 10 '17 at 12:19
  • I have the same issue described by @Omar14 (pyspark 2.2.0) – Floran Gmehlin Nov 03 '17 at 14:21
0

Try this :

    val df = Your_data_frame.registerTempTable("allData") // register your dataframe as a temp table

// we are finding max of date for each clientNum and value and join back to the original table.  

    sqlContext.sql("select a.ClientNum, a.Value, a.Date, a.Age, a.Country, a.Job from allData a
    join
    (select ClientNum, Value, max(Date) as max_date from allData group by ClientNum, Value) b
    on a.ClientNum = b.ClientNum and a.Value = b.Value and a.Date = b.max_date").show
Sanchit Grover
  • 998
  • 1
  • 6
  • 9
0

If like me, you had troubles with the other answers, here is my solution in Python using a UDF (spark 2.2.0):

Let's create a dummy dataset :

llist = [(1, 'alice', 'some_field', 'some_field', 'some_field', None), (30, 'bob', 'some_field', None, None, 10), (3, 'charles', 'some_field', None, 'some_other_field', 1111)]
df = sqlContext.createDataFrame(llist, ['id', 'name','field1','field2', 'field3', 'field4'])

df.show()

+---+-------+----------+----------+----------------+------+
| id|   name|    field1|    field2|          field3|field4|
+---+-------+----------+----------+----------------+------+
|  1|  alice|some_field|some_field|      some_field|  null|
| 30|    bob|some_field|      null|            null|    10|
|  3|charles|some_field|      null|some_other_field|  1111|
+---+-------+----------+----------+----------------+------+

Let's define our UDF for counting None values :

from pyspark.sql.types import IntegerType
from pyspark.sql.functions import struct, udf

count_empty_columns = udf(
                        lambda row: len([x for x in row if x is None]), 
                        IntegerType()
                      )

We can add a new column null_count based on that UDF :

df = df.withColumn('null_count',
        count_empty_columns(struct([df[x] for x in df.columns])))

df.show()

+---+-------+----------+----------+----------------+------+----------+
| id|   name|    field1|    field2|          field3|field4|null_count|
+---+-------+----------+----------+----------------+------+----------+
|  1|  alice|some_field|some_field|      some_field|  null|         1|
| 30|    bob|some_field|      null|            null|    10|         2|
|  3|charles|some_field|      null|some_other_field|  1111|         1|
+---+-------+----------+----------+----------------+------+----------+

And finally filter :

df = df.filter(df['null_count'] <= 1)
Floran Gmehlin
  • 824
  • 1
  • 11
  • 34