10

This is First table in Hive- It contains information about the item we are purchasing.

CREATE EXTERNAL TABLE IF NOT EXISTS Table1   (This is the MAIN table through which comparisons need to be made)
(
ITEM_ID BIGINT,
CREATED_TIME STRING,
BUYER_ID BIGINT
)

And this is the data in the above first table

**ITEM_ID**   **CREATED_TIME**    **BUYER_ID**
220003038067     2012-06-21        1015826235
300003861266     2012-06-21        1015826235
140002997245     2012-06-14        1015826235
200002448035     2012-06-08        1015826235
260003553381     2012-06-07        1015826235

This is Second table in Hive- It also contains information about the items we are purchasing.

CREATE EXTERNAL TABLE IF NOT EXISTS Table2
(
USER_ID BIGINT,
PURCHASED_ITEM ARRAY<STRUCT<PRODUCT_ID: BIGINT,TIMESTAMPS:STRING>>
)

And this is the data in the above table-

**USER_ID**    **PURCHASED_ITEM**
1015826235     [{"product_id":220003038067,"timestamps":"1340321132000"},    {"product_id":300003861266,"timestamps":"1340271857000"},    {"product_id":140002997245,"timestamps":"1339694926000"},    {"product_id":200002448035,"timestamps":"1339172659000"},    {"product_id":260003553381,"timestamps":"1339072514000"}]

I have reduced the data to only one BUYER_ID(USER_ID) to make the problem simple to understand.

Problem Statement-

I need to compare the Table2 with Table1, which means I need to see if USER_ID from Table2 and BUYER_ID from Table1 (as they both are same thing) gets matched, then PURCHASED_ITEM in Table2 which is an Array of PRODUCT_ID(same as ITEM_ID) and TIMESTAMPS(same as CREATED_TIME) should be same as ITEM_ID and CREATED_TIME in Table1 for that particular USER_ID(BUYER_ID) and also sometimes it is possible that they (means PURCHASED_ITEM and ITEM_ID, CREATED_TIME) are not same or some PRODUCT_ID and TIMESTAMPS is missing from Table2 after comparing from Table1.

By this I mean the count of PRODUCT_ID and TIMESTAMPS in Table2 should be same as count of ITEM_ID and CREATED_TIME in Table1 for that particular BUYER_ID(USER_ID) and the content should be same. If they are not same or entry is missing from Table2, then I need to print the result, this particular ITEM_ID and CREATED_TIME is missing from Table2 or the PRODUCT_ID and TIMESTAMPS are not same after comparing from Table1.

So for example in Table1 currently for this BUYER_ID 1015826235 I have 5 ITEM_ID and 5 CREATED_TIME, so in Table2 I should have 5 PRODUCT_ID and 5 TIMESTAMPS exactly same as Table1 for same USER_ID(BUYER_ID) in one row. If it is not same or entry is missing then I need to print the result showing this is missing or this data is wrong.

So just to make it more clear-

PURCHASED_ITEM is an array of Struct in Table2 and it contains two things PRODUCT_ID and TIMESTAMPS.

If USER_ID and BUYER_ID gets matched then PRODUCT_ID in Table2 should be matched with ITEM_ID in Table1 and TIMESTAMPS in Table2 should be matched with CREATED_TIME in Table1.

UPDATED

HiveQL SQL Query Question:-

Q 1) Find all USER_ID from Table2 whose PRODUCT_ID or TIMESTAMP are not same with
ITEM_ID or CREATED_TIME after comparing with Table1 on BUYER_ID.

Query that I wrote for first question. Is the query right?

A 1) select Table2.user_id from Table2 where Table1.user_id = Table2.buyer_id
and (Table1.item_id <> Table2.product_id or UNIX_TIMESTAMP(Table1.created_time) <>
Table2.timestamps)


Q 2) Find the `BUYER_ID(USER_ID)` and as well as those `ITEM_ID` and `CREATED_TIME`
which are missing from `Table2` after comparing from `Table1` on `BUYER_ID`.

A 2) Not sure.
AKIWEB
  • 19,008
  • 67
  • 180
  • 294

2 Answers2

1

I'm not familiar with Hive, but I'd suggest that you create a temporary table with the same schema as Table1, and fill it with Table2 data (with timestamp conversion). This could eventually be a view, if supported.

Comparing the content of two tables is then possible with queries such as:

SELECT * FROM Table1 WHERE (ITEM_ID,CREATED_TIME,BUYER_ID) NOT IN (SELECT * FROM Table2bis)

SELECT * FROM Table2bis WHERE (ITEM_ID,CREATED_TIME,BUYER_ID) NOT IN (SELECT * FROM Table1)
tera
  • 46
  • 3
1

I suggest you not to use "string" data type for your CREATED_TIME and timestamp because it makes comparisons Harder. Instead of that use Date or TimeStamp.

And for your question: I think the big problem here is using strings alone!

I'm oracle user but there should be something like this in Hive:

To_date({string},{Format})

as you used

UNIX_TIMESTAMP({string})

Another thing: when you have strucs, you shouls address fields like this: Table2.PURCHASED_ITEM[{address}].product_id and not Table2.product_id which is unknown.

and one more suggestion:

Trunc({Date},{Format ex: 'SS' for sseconds})

when your CREATED_TIME and your time_stamp are not exactly in the same time ticks(may be 0.001 seconds difference because of difference insert time if you insert Now or Sysdate for each of them) you better truncate the date to seconds or Milli-seconds or whatever you think is better.

One more thing: Use NVL() or Convert null values here as well, becuase if you have such problems it is also possible to have null values in your table which causes problems in your queries, NVL() function will convert null to something you like.

Hope this helps.

hmmftg
  • 1,274
  • 1
  • 18
  • 31