i want to obtain max value of a column depending on two other columns and for the forth column the value of the most repeated number

Question

I've got this dataframe

df1 = spark.createDataFrame([
    ('c', 'd', 3.0, 4),
    ('c', 'd', 7.3, 8),
    ('c', 'd', 7.3, 2),
    ('c', 'd', 7.3, 8),
    ('e', 'f', 6.0, 3),
    ('e', 'f', 6.0, 8),
    ('e', 'f', 6.0, 3),
    ('c', 'j', 4.2, 3),
    ('c', 'j', 4.3, 9),
], ['a', 'b', 'c', 'd'])
df1.show()
+---+---+---+---+
|  a|  b|  c|  d|
+---+---+---+---+
|  c|  d|3.0|  4|
|  c|  d|7.3|  8|
|  c|  d|7.3|  2|
|  c|  d|7.3|  8|
|  e|  f|6.0|  3|
|  e|  f|6.0|  8|
|  e|  f|6.0|  3|
|  c|  j|4.2|  3|
|  c|  j|4.3|  9|
+---+---+---+---+

i did this to get the max of c of the couple a and b

df2 = df1.groupBy('a', 'b').agg(F.max('c').alias('c_max')).select(
        F.col('a'),
        F.col('b'),
        F.col('c_max').alias('c')
    )
df2.show()
+---+---+---+
|  a|  b|  c|
+---+---+---+
|  e|  f|6.0|
|  c|  d|7.3|
|  c|  j|4.3|
+---+---+---+

but now i need to get the values of d that should be

+---+---+---+---+
|  a|  b|  c|  d|
+---+---+---+---+
|  c|  d|7.3|  8|
|  e|  f|6.0|  3|
|  c|  j|4.3|  9|
+---+---+---+---+

i tried to do an inner join between df1 and df2 but that didn't work:

condition = [df1.a ==  df2.a, df1.b ==  df2.b, df1.c ==  df2.c]
df3 = df1.join(df2,condition,"inner")
df3.show()
+---+---+---+---+---+---+---+
|  a|  b|  c|  d|  a|  b|  c|
+---+---+---+---+---+---+---+
|  c|  d|7.3|  8|  c|  d|7.3|
|  c|  d|7.3|  8|  c|  d|7.3|
|  c|  d|7.3|  2|  c|  d|7.3|
|  e|  f|6.0|  3|  e|  f|6.0|
|  e|  f|6.0|  8|  e|  f|6.0|
|  e|  f|6.0|  3|  e|  f|6.0|
|  c|  j|4.3|  9|  c|  j|4.3|
+---+---+---+---+---+---+---+

i'm a beginner in pyspark, so please i need a little help to figure this out

for the d column, what is the logic that "9" is chosen over "3" for row3 (c&j)? More generally, what is the logic for tie breaker? — Emma, Apr 19 '22 at 15:06

pltc · Accepted Answer · 2022-04-20T15:03:45.780

0

You can "zip" d and count of d and aggregate as usual to keep the frequency

df3 = (df1
    .groupBy('a', 'b', 'd')
    .agg(F.count('*').alias('d_count'))
    .groupBy('a', 'b')
    .agg(F.max(F.array('d_count', 'd')).alias('d_freq'))
    .select('a', 'b', F.col('d_freq')[1].alias('d'))
)

+---+---+---+
|  a|  b|  d|
+---+---+---+
|  c|  d|  8|
|  c|  j|  9|
|  e|  f|  3|
+---+---+---+

Now join both your df2 and this new df3 will give your desired output.

df2.join(df3, on=['a', 'b']).show()
+---+---+---+---+
|  a|  b|  c|  d|
+---+---+---+---+
|  c|  d|7.3|  8|
|  c|  j|4.3|  9|
|  e|  f|6.0|  3|
+---+---+---+---+

edited Apr 20 '22 at 15:03

answered Apr 19 '22 at 15:31

pltc

5,836
1
13
31

OP is asking for "most repeated number" not max for d. – Emma Apr 19 '22 at 17:53
It's max for `c`, `d`'s value OP wants is attached to that max `c` – pltc Apr 19 '22 at 17:59
1

I saw the dataframe at "now i need to get the values of d that should be". it is 8, 3, 9. not 8, 9, 8 for d. Also, title says for the fourth column, most repeated number. – Emma Apr 19 '22 at 18:00
Ah I think you're right! My bad – pltc Apr 19 '22 at 18:01
np! I still don't know what OP wants for the tie breaker though. – Emma Apr 19 '22 at 18:02
maybe they don't even care (yet) :) – pltc Apr 19 '22 at 18:06
thank you for all your help ^^, this was given for me to do and it's been three days that i'm trying and can't seem to find the answer, i think it's impossible to do, if i wanted the max of d and c that would be so much easier but unfortunatly it's not the case – sunny Apr 20 '22 at 07:37
@sunny I updated my answer – pltc Apr 20 '22 at 15:04
thank you for all your efforts, it did work – sunny Apr 22 '22 at 07:45

score 0 · Answer 2 · answered Apr 20 '22 at 14:50

You can first count the frequency and assign the order value by sorting them in descending order. Then, get the first value where the order is 1.

This does not deal with tie breaking, if there are tie in the top frequency, this will pick whatever (non-deterministic).

from pyspark.sql import functions as F

df1 = (df1.withColumn('d_count', F.count('*').over(Window.partitionBy(['a', 'b', 'd'])))
 .withColumn('d_order', F.row_number().over(Window.partitionBy(['a', 'b']).orderBy(F.desc('d_count'))))
 .groupby(['a', 'b'])
 .agg(
   F.max('c').alias('c'),
   F.first(F.when(F.col('d_order') == 1, F.col('d'))).alias('d'))
)

# df1.show()

+---+---+---+---+
|  a|  b|  c|  d|
+---+---+---+---+
|  e|  f|6.0|  3|
|  c|  d|7.3|  8|
|  c|  j|4.3|  9|
+---+---+---+---+

thank you a lot @Emma , sorry i'm new to stackoverflow so i just saw your answer, i tested it and it does work !, i think yours is the best option since we will create only one dataframe and it's a great way for me to try to understand window functions and learn how to use them — sunny, Apr 21 '22 at 09:32

i want to obtain max value of a column depending on two other columns and for the forth column the value of the most repeated number

2 Answers2

Linked