I have created a data frame which contains Year, Month, and the occurrence of incidents (count).
I want to find the month of each year had the most incident using spark SQL.
I have created a data frame which contains Year, Month, and the occurrence of incidents (count).
I want to find the month of each year had the most incident using spark SQL.
You can use window functions, if you want to use SQL:
select t.*
from (select t.*,
row_number() over (partition by year order by count desc) as seqnum
from t
) t
where seqnum = 1;
This returns one row per year, even if there are ties for the maximum count. If you want all such rows in the event of ties, then use rank()
instead of row_number()
.
You can use window functions:
select *
from (select t.*, rank() over(partition by year order by cnt desc) rn from mytable t) t
where rn = 1
For each year, this gives you the row that has the greatest cnt
. If there are ties, the query returns them.
Note that count
is a language keyword in SQL, hence not a good choice for a column name. I renamed it to cnt
in the query.