I am looking to retrieve the first row and last row over a window in HiveQL.
I know there are a couple ways to do this:
- Use FIRST_VALUE and LAST_VALUE on the columns I am interested in.
SELECT customer, FIRST_VALUE(product) over (W), FIRST_VALUE(time) over (W), LAST_VALUE(product) over (W), LAST_VALUE(time) over (W) FROM table WINDOW W AS (PARTITION BY customer ORDER BY COST)
- Calculate ROW_NUMBER() of each row and use a where clause for row_number=1.
WITH table_wRN AS
(
SELECT *,
row_number() over (partition by customer order by cost ASC) rn_B,
row_number() over (partition by customer order by cost DESC) rn_E
FROM table
),
table_first_last AS
(
SELECT *
FROM table_wRN
WHERE (rn_E=1 OR rn_B=1)
)
SELECT table_first.customer,
table_first.product, table_first.time,
table_last.product, table_last.time
FROM table_first_last as table_first WHERE table_first_last.rn_B=1
JOIN table_first_last as table_last WHERE table_first_last.rn_E=1
ON table_first.customer = table_last.customer
My questions:
- Does anyone know which of these two is more efficient?
- Intuitively, I think the first one should be faster because there is no need for a sub-query or a CTE.
- Experimentally, I feel the second is faster but this could be because I am running first_value on a number of columns.
- Is there a way to apply first_value and retrieve multiple columns in one shot.
- I am looking to reduce the number of times the windowing is done / evaluated (something like cache the window)
- Example of pseudo-code:
FIRST_VALUE(product,time) OVER (W) AS product_first, time_first
Thank you!