This is First table in Hive- It contains information about the item we are purchasing.
CREATE EXTERNAL TABLE IF NOT EXISTS Table1 (This is the MAIN table through which comparisons need to be made)
(
ITEM_ID BIGINT,
CREATED_TIME STRING,
BUYER_ID BIGINT
)
And this is the data in the above first table
**ITEM_ID** **CREATED_TIME** **BUYER_ID**
220003038067 2012-06-21 1015826235
300003861266 2012-06-21 1015826235
140002997245 2012-06-14 1015826235
200002448035 2012-06-08 1015826235
260003553381 2012-06-07 1015826235
This is Second table in Hive- It also contains information about the items we are purchasing.
CREATE EXTERNAL TABLE IF NOT EXISTS Table2
(
USER_ID BIGINT,
PURCHASED_ITEM ARRAY<STRUCT<PRODUCT_ID: BIGINT,TIMESTAMPS:STRING>>
)
And this is the data in the above table-
**USER_ID** **PURCHASED_ITEM**
1015826235 [{"product_id":220003038067,"timestamps":"1340321132000"}, {"product_id":300003861266,"timestamps":"1340271857000"}, {"product_id":140002997245,"timestamps":"1339694926000"}, {"product_id":200002448035,"timestamps":"1339172659000"}, {"product_id":260003553381,"timestamps":"1339072514000"}]
I have reduced the data to only one BUYER_ID(USER_ID) to make the problem simple to understand.
Problem Statement-
I need to compare the Table2
with Table1
, which means I need to see if USER_ID
from Table2
and BUYER_ID
from Table1
(as they both are same thing) gets matched, then PURCHASED_ITEM
in Table2 which is an Array of PRODUCT_ID(same as ITEM_ID) and TIMESTAMPS(same as CREATED_TIME) should be same as ITEM_ID
and CREATED_TIME
in Table1
for that particular USER_ID(BUYER_ID) and also sometimes it is possible that they (means PURCHASED_ITEM
and ITEM_ID
, CREATED_TIME
) are not same or some PRODUCT_ID and TIMESTAMPS is missing from Table2
after comparing from Table1
.
By this I mean the count of PRODUCT_ID
and TIMESTAMPS
in Table2
should be same as count of ITEM_ID
and CREATED_TIME
in Table1 for that particular BUYER_ID(USER_ID) and the content should be same. If they are not same or entry is missing from Table2
, then I need to print the result, this particular ITEM_ID
and CREATED_TIME
is missing from Table2
or the PRODUCT_ID
and TIMESTAMPS
are not same after comparing from Table1
.
So for example in Table1 currently for this BUYER_ID 1015826235
I have 5 ITEM_ID
and 5 CREATED_TIME
, so in Table2 I should have 5 PRODUCT_ID
and 5 TIMESTAMPS
exactly same as Table1 for same USER_ID(BUYER_ID)
in one row. If it is not same or entry is missing then I need to print the result showing this is missing or this data is wrong.
So just to make it more clear-
PURCHASED_ITEM
is an array of Struct in Table2
and it contains two things PRODUCT_ID
and TIMESTAMPS
.
If USER_ID
and BUYER_ID
gets matched then PRODUCT_ID
in Table2
should be matched with ITEM_ID
in Table1
and TIMESTAMPS
in Table2
should be matched with CREATED_TIME
in Table1
.
UPDATED
HiveQL SQL Query Question:-
Q 1) Find all USER_ID from Table2 whose PRODUCT_ID or TIMESTAMP are not same with
ITEM_ID or CREATED_TIME after comparing with Table1 on BUYER_ID.
Query that I wrote for first question. Is the query right?
A 1) select Table2.user_id from Table2 where Table1.user_id = Table2.buyer_id
and (Table1.item_id <> Table2.product_id or UNIX_TIMESTAMP(Table1.created_time) <>
Table2.timestamps)
Q 2) Find the `BUYER_ID(USER_ID)` and as well as those `ITEM_ID` and `CREATED_TIME`
which are missing from `Table2` after comparing from `Table1` on `BUYER_ID`.
A 2) Not sure.