I have the following dataframe-
>>> my_df.show(3)
+------------+---------+-------+--------------+
| user_id| address| type|count| country|
+------------+---------+-------+-----+--------+
| ABC123| yyy,USA| animal| 2| USA|
| ABC123| xxx,USA| animal| 3| USA|
| qwerty| 55A,AUS| human| 3| AUS|
| ABC123| zzz,RSA| animal| 4| RSA|
+------------+---------+-------+--------------+
How do I roll-up this dataframe to get the following result-
>>> new_df.show(3)
+------------+---------+-------+--------------+
| user_id| address| type|count| country|
+------------+---------+-------+-----+--------+
| qwerty| 55A,AUS| human| 3| AUS|
| ABC123| xxx,USA| animal| 5| USA|
+------------+---------+-------+--------------+
For a given user_id
:
- Get the
country
with the highest sum of counts - For the
country
got in step 1, get theaddress
with the highest count
I'm guessing I'll have to split my_df
into 2 different dataframes and get the country
and address
separately. But I don't exactly know the syntax for that. Your help is appreciated. Thanks.