1

I have a table with a lot of cumulative columns, these columns reset to 0 at the end of each month. If I sum this data, I'll end up double counting. Instead, With Hive, I'm trying to select the max date of each month.

I've tried this:

SELECT
    yyyy_mm_dd,
    id,
    name,
    cumulative_metric1,
    cumulative_metric2
FROM
    mytable

WHERE
    yyyy_mm_dd = last_day(yyyy_mm_dd)

mytable has daily data from the start of the year. In the output of the above, I only see the last date for January but not February. How can I select the last day of each month?

stackq
  • 491
  • 2
  • 15

2 Answers2

1

February is not over yet. Perhaps a window function does what you want:

SELECT yyyy_mm_dd, id, name, cumulative_metric1, cumulative_metric2
FROM (SELECT t.*,
             MAX(yyyy_mm_dd) OVER (PARTITION BY last_day(yyyy_mm_dd)) as last_yyyy_mm_dd
      FROM mytable t
     ) t
WHERE yyyy_mm_dd = last_yyyy_mm_dd;

This calculates the last day in the data.

Gordon Linoff
  • 1,242,037
  • 58
  • 646
  • 786
  • Should the inner table be aliased to `t` also? Also, this is a partitioned table so the inner table would also need `yyyy_mm_dd` in a `where` condition as that's the partitioned column. – stackq Feb 10 '20 at 13:28
0

use correlated subquery and date to month function in hive

SELECT
    yyyy_mm_dd,
    id,
    name,
    cumulative_metric1,
    cumulative_metric2
FROM
    mytable t1

WHERE
    yyyy_mm_dd = select max(yyyy_mm_dd) from mytable t2 where
     month(t1.yyyy_mm_dd)= month(t2.yyyy_mm_dd)
Zaynul Abadin Tuhin
  • 31,407
  • 5
  • 33
  • 63