This can be accomplished using a series of self-joins to find other rooms in the same category before combining the results into 2 maps.
Code
CREATE TABLE `table` AS
SELECT 1 AS customer, 'A' AS category, 'aa' AS room, 'd1' AS `date` UNION ALL
SELECT 1 AS customer, 'A' AS category, 'bb' AS room, 'd2' AS `date` UNION ALL
SELECT 1 AS customer, 'B' AS category, 'cc' AS room, 'd3' AS `date` UNION ALL
SELECT 1 AS customer, 'C' AS category, 'aa' AS room, 'd1' AS `date` UNION ALL
SELECT 1 AS customer, 'C' AS category, 'bb' AS room, 'd2' AS `date` UNION ALL
SELECT 2 AS customer, 'A' AS category, 'aa' AS room, 'd3' AS `date` UNION ALL
SELECT 2 AS customer, 'A' AS category, 'bb' AS room, 'd4' AS `date` UNION ALL
SELECT 2 AS customer, 'C' AS category, 'bb' AS room, 'd4' AS `date` UNION ALL
SELECT 2 AS customer, 'C' AS category, 'ee' AS room, 'd5' AS `date` UNION ALL
SELECT 3 AS customer, 'D' AS category, 'ee' AS room, 'd6' AS `date`
;
SELECT
customer_rooms.customer,
collect(customer_rooms.room, customer_rooms.date) AS map_customer_room_date,
collect(
COALESCE(customer_category_rooms.room, category_rooms.room),
COALESCE(customer_category_rooms.date, category_rooms.date)) AS map_category_room_date
FROM `table` AS customer_rooms
JOIN `table` AS category_rooms ON customer_rooms.category = category_rooms.category
LEFT OUTER JOIN `table` AS customer_category_rooms ON customer_rooms.customer = customer_category_rooms.customer
AND category_rooms.category = customer_category_rooms.category
AND category_rooms.room = customer_category_rooms.room
WHERE (
customer_rooms.customer = customer_category_rooms.customer AND
customer_rooms.category = customer_category_rooms.category AND
customer_rooms.room = customer_category_rooms.room AND
customer_rooms.date = customer_category_rooms.date
)
OR (
customer_category_rooms.customer IS NULL AND
customer_category_rooms.category IS NULL AND
customer_category_rooms.room IS NULL AND
customer_category_rooms.date IS NULL
)
GROUP BY
customer_rooms.customer
;
Result Set
1 {"aa":"d1","bb":"d2","cc":"d3"} {"aa":"d1","bb":"d2","cc":"d3","ee":"d5"}
2 {"aa":"d3","bb":"d4","ee":"d5"} {"aa":"d3","bb":"d4","ee":"d5"}
3 {"ee":"d6"} {"ee":"d6"}
Explanation
FROM `table` AS customer_rooms
First, results are drawn from the initial table
. We name this relation customer_rooms
. As you already noted in the question, this much is sufficient to build map_customer_room_date
.
JOIN `table` AS category_rooms ON customer_rooms.category = category_rooms.category
The first self-join identifies all rooms that have the same categories as the rooms explicitly mentioned in the customer_rooms
rows. We name this relation category_rooms
.
LEFT OUTER JOIN `table` AS customer_category_rooms ON customer_rooms.customer = customer_category_rooms.customer
AND category_rooms.category = customer_category_rooms.category
AND category_rooms.room = customer_category_rooms.room
The second self-join takes the rooms that we identified in category_rooms
and attempts to find if this room is already held by the customer identified in customer_rooms
. We name this relation customer_category_rooms
. This is a LEFT OUTER JOIN
, because we want to preserve all rows from the prior joins. The outcome will either be 1) the values from customer_rooms
and customer_category_rooms
are identical, because the customer already holds this room, or 2) the values from customer_category_rooms
will be all NULL
, because the customer doesn't hold this room, but it's a room in one of the same categories. This distinction will become important so that we can preserve the date
of the customer if they already hold the room.
Next, we need to filter.
WHERE (
customer_rooms.customer = customer_category_rooms.customer AND
customer_rooms.category = customer_category_rooms.category AND
customer_rooms.room = customer_category_rooms.room AND
customer_rooms.date = customer_category_rooms.date
)
This includes the rooms that are explicitly held by the customer in the original table
.
OR (
customer_category_rooms.customer IS NULL AND
customer_category_rooms.category IS NULL AND
customer_category_rooms.room IS NULL AND
customer_category_rooms.date IS NULL
)
This includes the rooms that are not held by the customer but belong in the same categories as rooms held by the customer.
collect(customer_rooms.room, customer_rooms.date) AS map_customer_room_date,
map_customer_room_date
can be built by collecting the original data from the table, which we had aliased as customer_rooms
.
collect(
COALESCE(customer_category_rooms.room, category_rooms.room),
COALESCE(customer_category_rooms.date, category_rooms.date)) AS map_category_room_date
Building map_category_room_date
is more complex. If the customer holds the room explicitly, then we want to preserve that date
. However, if the customer does not hold the room explicitly, then we want to be able to use the room
and date
from another row that has an overlapping category. To accomplish that, we use the Hive COALESCE function to choose the first value that is not NULL
. If the customer already holds the room (as present in non-NULL
values in customer_category_rooms
), then we'll use that. If not, then we'll use values from category_rooms
instead.
Note that there can still be some ambiguity if the same category/room combination can map to multiple date
values. If that is significant, then you might need to put more work into this to choose the right date
based on some business rule (e.g. use the soonest date
) or map to multiple date
values instead of a single value. This should give you a good starting point though if there are additional requirements like that.