24

I have an Hive table made of user_id and item_id (id of items that have been purchased by the user). I want to get a list of all the users who purchased item 1 but not item 2 and 3.

To do this I wrote the simple query:

SELECT user_id, collect_set(item_id) itemslist FROM mytable
WHERE item_id in (1, 2)
GROUP BY user_id
HAVING -- what should I put here???

As you can see, I don't know how to check whether the array itemslist contains 1 and not 2.

How do you do this? If there is some more efficient way can you please tell me both (or more) methods?

dlamblin
  • 43,965
  • 20
  • 101
  • 140
lucacerone
  • 9,859
  • 13
  • 52
  • 80

1 Answers1

51

There are some collection functions in Hive `(See collection functions here : https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF ) which can use here.

You can use the array_contains(Array<T>, value) function to check if item 1 is present and the size(Array<T>) function to make sure the length is 1. If both conditions are satisfied, you will get the desired output.

Hubbitus
  • 5,161
  • 3
  • 41
  • 47
Amar
  • 3,825
  • 1
  • 23
  • 26
  • what if I wanted to display the item found in the select statement, how would I go about that? – Augmented Jacob Dec 27 '16 at 23:03
  • 1
    array_contains is not working with regex pattern. For instance, if I want to check there is any item in array that have substring foo.*. Do you have any suggestion? – Reihan_amn Feb 15 '18 at 00:11