HiveQL - first_value of multiple columns over window

Question

I am looking to retrieve the first row and last row over a window in HiveQL.
I know there are a couple ways to do this:

Use FIRST_VALUE and LAST_VALUE on the columns I am interested in.

SELECT customer,
FIRST_VALUE(product) over (W),
FIRST_VALUE(time) over (W),
LAST_VALUE(product) over (W),
LAST_VALUE(time) over (W)
FROM table
WINDOW W AS (PARTITION BY customer ORDER BY COST)

Calculate ROW_NUMBER() of each row and use a where clause for row_number=1.

WITH table_wRN AS  
(
    SELECT *,
    row_number() over (partition by customer order by cost ASC) rn_B,
    row_number() over (partition by customer order by cost DESC) rn_E
    FROM table 
),
table_first_last AS
(
SELECT *
FROM table_wRN 
WHERE (rn_E=1 OR rn_B=1)
)

SELECT table_first.customer,
table_first.product, table_first.time,
table_last.product, table_last.time
FROM table_first_last as table_first WHERE table_first_last.rn_B=1
JOIN table_first_last as table_last WHERE table_first_last.rn_E=1
ON table_first.customer = table_last.customer

My questions:

Does anyone know which of these two is more efficient?
- Intuitively, I think the first one should be faster because there is no need for a sub-query or a CTE.
- Experimentally, I feel the second is faster but this could be because I am running first_value on a number of columns.
Is there a way to apply first_value and retrieve multiple columns in one shot.
- I am looking to reduce the number of times the windowing is done / evaluated (something like cache the window)
- Example of pseudo-code:
  - FIRST_VALUE(product,time) OVER (W) AS product_first, time_first

Thank you!

What you mean by "experimentally"? Did you actually test the two approaches? Please post some sample data (ddl/dml statements) and output so we know what exactly you're trying to achieve — Radagast, Mar 10 '20 at 18:59
Good question - all I meant by "experimentally" is a HiveQL run on a Hadoop database, and it just locks up on the last reduce phase. After over 12 hours, I got nothing. — penguin, Mar 10 '20 at 19:46

score 1 · Accepted Answer · answered Mar 10 '20 at 19:02

I am almost certain that the first would be more efficient. I mean two window functions versus two window functions, filtering and two joins?

Once you multiply the number of columns, then there might be an issue of which is faster. That said, look at the execution plan. I would expect that all window functions using the same window frame specification would use the same "windows" processing, with just tweaks for each value.

Hive does not have very good support for complex data types such as strings and arrays. In databases that do, it is easy enough to provide a complex type.

HiveQL - first_value of multiple columns over window

1 Answers1