3

I have a table that looks like this:

|customer|category|room|date|
-----------------------------
|1       |   A    | aa | d1 |
|1       |   A    | bb | d2 |
|1       |   B    | cc | d3 |
|1       |   C    | aa | d1 |
|1       |   C    | bb | d2 |
|2       |   A    | aa | d3 |
|2       |   A    | bb | d4 |
|2       |   C    | bb | d4 |
|2       |   C    | ee | d5 |
|3       |   D    | ee | d6 |

I want to create two maps out of the table:

1st. map_customer_room_date: will group by customer and collect all different rooms (key) and with date (value).

I'm using the collect() UDF Brickhouse function.

this can be archived with something similar as:

select customer, collect(room,date) as map_customer_room_date
from table
group by customer

2nd. map_category_room_date A bit more complicated, consists also of the same map type collect(room, date) and it will contain as keys, all the rooms across ALL categories where for customer X is categories. This means that for customer1 it will take room ee even though it belongs to customer2. This is because customer1 has category C and this category is also present in customer 2.

The final table is grouped by customer and will look like:

|customer| map_customer_room_date  |     map_category_room_date    |
-------------------------------------------------------------------|
|   1    |{aa: d1, bb: d2, cc: d3} |{aa: d1, bb: d2, cc: d3,ee: d6}|
|   2    |{aa: d3, bb: d4, ee: d6} |{aa: d3, bb: d4, ee: d6}       |
|   3    |{ee: d6}                 |{ee: d6}                       |  

I am having issues building the second map and presenting the final table as described. Any idea how this can be accomplished?

Chris Nauroth
  • 9,614
  • 1
  • 35
  • 39
Lou_Ds
  • 531
  • 2
  • 11
  • 23

1 Answers1

1

This can be accomplished using a series of self-joins to find other rooms in the same category before combining the results into 2 maps.

Code

CREATE TABLE `table` AS
SELECT 1 AS customer, 'A' AS category, 'aa' AS room, 'd1' AS `date` UNION ALL
SELECT 1 AS customer, 'A' AS category, 'bb' AS room, 'd2' AS `date` UNION ALL
SELECT 1 AS customer, 'B' AS category, 'cc' AS room, 'd3' AS `date` UNION ALL
SELECT 1 AS customer, 'C' AS category, 'aa' AS room, 'd1' AS `date` UNION ALL
SELECT 1 AS customer, 'C' AS category, 'bb' AS room, 'd2' AS `date` UNION ALL
SELECT 2 AS customer, 'A' AS category, 'aa' AS room, 'd3' AS `date` UNION ALL
SELECT 2 AS customer, 'A' AS category, 'bb' AS room, 'd4' AS `date` UNION ALL
SELECT 2 AS customer, 'C' AS category, 'bb' AS room, 'd4' AS `date` UNION ALL
SELECT 2 AS customer, 'C' AS category, 'ee' AS room, 'd5' AS `date` UNION ALL
SELECT 3 AS customer, 'D' AS category, 'ee' AS room, 'd6' AS `date`
;


SELECT
    customer_rooms.customer,
    collect(customer_rooms.room, customer_rooms.date) AS map_customer_room_date,
    collect(
        COALESCE(customer_category_rooms.room, category_rooms.room),
        COALESCE(customer_category_rooms.date, category_rooms.date)) AS map_category_room_date
FROM `table` AS customer_rooms
JOIN `table` AS category_rooms ON customer_rooms.category = category_rooms.category
LEFT OUTER JOIN `table` AS customer_category_rooms ON customer_rooms.customer = customer_category_rooms.customer
AND category_rooms.category = customer_category_rooms.category
AND category_rooms.room = customer_category_rooms.room
WHERE (
    customer_rooms.customer = customer_category_rooms.customer AND
    customer_rooms.category = customer_category_rooms.category AND
    customer_rooms.room = customer_category_rooms.room AND
    customer_rooms.date = customer_category_rooms.date
)
OR (
    customer_category_rooms.customer IS NULL AND
    customer_category_rooms.category IS NULL AND
    customer_category_rooms.room IS NULL AND
    customer_category_rooms.date IS NULL
)
GROUP BY
    customer_rooms.customer
;

Result Set

1   {"aa":"d1","bb":"d2","cc":"d3"} {"aa":"d1","bb":"d2","cc":"d3","ee":"d5"}
2   {"aa":"d3","bb":"d4","ee":"d5"} {"aa":"d3","bb":"d4","ee":"d5"}
3   {"ee":"d6"} {"ee":"d6"}

Explanation

FROM `table` AS customer_rooms

First, results are drawn from the initial table. We name this relation customer_rooms. As you already noted in the question, this much is sufficient to build map_customer_room_date.

JOIN `table` AS category_rooms ON customer_rooms.category = category_rooms.category

The first self-join identifies all rooms that have the same categories as the rooms explicitly mentioned in the customer_rooms rows. We name this relation category_rooms.

LEFT OUTER JOIN `table` AS customer_category_rooms ON customer_rooms.customer = customer_category_rooms.customer
AND category_rooms.category = customer_category_rooms.category
AND category_rooms.room = customer_category_rooms.room

The second self-join takes the rooms that we identified in category_rooms and attempts to find if this room is already held by the customer identified in customer_rooms. We name this relation customer_category_rooms. This is a LEFT OUTER JOIN, because we want to preserve all rows from the prior joins. The outcome will either be 1) the values from customer_rooms and customer_category_rooms are identical, because the customer already holds this room, or 2) the values from customer_category_rooms will be all NULL, because the customer doesn't hold this room, but it's a room in one of the same categories. This distinction will become important so that we can preserve the date of the customer if they already hold the room.

Next, we need to filter.

WHERE (
    customer_rooms.customer = customer_category_rooms.customer AND
    customer_rooms.category = customer_category_rooms.category AND
    customer_rooms.room = customer_category_rooms.room AND
    customer_rooms.date = customer_category_rooms.date
)

This includes the rooms that are explicitly held by the customer in the original table.

OR (
    customer_category_rooms.customer IS NULL AND
    customer_category_rooms.category IS NULL AND
    customer_category_rooms.room IS NULL AND
    customer_category_rooms.date IS NULL
)

This includes the rooms that are not held by the customer but belong in the same categories as rooms held by the customer.

    collect(customer_rooms.room, customer_rooms.date) AS map_customer_room_date,

map_customer_room_date can be built by collecting the original data from the table, which we had aliased as customer_rooms.

    collect(
        COALESCE(customer_category_rooms.room, category_rooms.room),
        COALESCE(customer_category_rooms.date, category_rooms.date)) AS map_category_room_date

Building map_category_room_date is more complex. If the customer holds the room explicitly, then we want to preserve that date. However, if the customer does not hold the room explicitly, then we want to be able to use the room and date from another row that has an overlapping category. To accomplish that, we use the Hive COALESCE function to choose the first value that is not NULL. If the customer already holds the room (as present in non-NULL values in customer_category_rooms), then we'll use that. If not, then we'll use values from category_rooms instead.

Note that there can still be some ambiguity if the same category/room combination can map to multiple date values. If that is significant, then you might need to put more work into this to choose the right date based on some business rule (e.g. use the soonest date) or map to multiple date values instead of a single value. This should give you a good starting point though if there are additional requirements like that.

Chris Nauroth
  • 9,614
  • 1
  • 35
  • 39