1

I have created a data frame which contains Year, Month, and the occurrence of incidents (count).

enter image description here

I want to find the month of each year had the most incident using spark SQL.

kimhkh
  • 27
  • 4

2 Answers2

0

You can use window functions, if you want to use SQL:

select t.*
from (select t.*,
             row_number() over (partition by year order by count desc) as seqnum
      from t
     ) t
where seqnum = 1;

This returns one row per year, even if there are ties for the maximum count. If you want all such rows in the event of ties, then use rank() instead of row_number().

Gordon Linoff
  • 1,242,037
  • 58
  • 646
  • 786
  • 3
    I know, it's hard work to answer questions quicker than anyone else, but sometimes you should read your answer before you post it and go to the next question. – dnoeth Oct 03 '20 at 17:30
0

You can use window functions:

select *
from (select t.*, rank() over(partition by year order by cnt desc) rn from mytable t) t
where rn = 1

For each year, this gives you the row that has the greatest cnt. If there are ties, the query returns them.

Note that count is a language keyword in SQL, hence not a good choice for a column name. I renamed it to cnt in the query.

GMB
  • 216,147
  • 25
  • 84
  • 135
  • I am pretty new to Spark SQL, and I don't know much about window functions. Is there any way I can do it without the window function? – kimhkh Oct 03 '20 at 17:33
  • @kimhkh: there are other options, but this looks like the simplest approach here. I would recommend running the subquery first, so you can see the result and understands how it works. For example: `select t.*, rank() over(partition by year order by cnt desc) rn from mytable t order by year, rn`. – GMB Oct 03 '20 at 17:34