Hive external table - Select unmatched record

Question

I have created two Hive external tables (SQL query will work) which points to location where I need to compare two output.

I need to compare both tables and select unmatched records.

TableA

id   sdate   edate  tag

S1 20180610 20180611 0

S2 20180610 20180612 0

S3 20180612 20180613 0

S5 20180612 20180613 1

TableB

id  sdate    edate  tag

S1 20180610 20180611 0

S2 20180611 20180612 0

S3 20180612 20180613 1

S4 20180612 20180613 1

Required output

S3 20180612 20180613 0

S5 20180612 20180613 1

S4 20180612 20180613 1

Tried to write query by joining two tables but did'nt work for me.

Appreciate help on this

Thanks :)

you should specify what fields you are using for matching ... is not clear why S3 is part of the result, are you using all the fields? — hlagos, Jun 15 '18 at 17:19
Sorry for not posting my question correctly. I need to compare all fields. Thats my challenge. S3 should select because tag value is different from table A. — Pranav Chembath, Jun 15 '18 at 18:03
Please provide what have you tried and what exactly didn't work. — leftjoin, Jun 17 '18 at 06:52

score 0 · Answer 1 · answered Jun 15 '18 at 19:06

0

This query will help you to identify the records in an efficient way

create table unmatched as 
select 
a.*
from tableA as a left join (select *, true as flag from tableB) as b on
a.id=b.id a.sdate=b.sdate a.edate=b.edate a.tag=b.tag
where b.flag is null --this will get records in tableA but not in table B
union all 
select 
b.*
from tableB as b left join (select *, true as flag from tableA) as a on
a.id=b.id a.sdate=b.sdate a.edate=b.edate a.tag=b.tag
where a.flag is null --this will get records in tableB but not in table A
;

You can perform this using a full join but it will be much more inefficient

answered Jun 15 '18 at 19:06

hlagos

7,690
3
23
41

FULL join is the same as 'A left join B UNION ALL B left join A' . Why do you think FULL join is inefficient – leftjoin Jun 17 '18 at 06:55
good question! I think I should have added more context. Using left join will allow you create easily some optimizations depending on the distribution of the data. For example if your keys are skewed distributed you could run a (select distinct keys) on the subquery, you will not be able to perform this in a single full join. – hlagos Jun 18 '18 at 03:15

score 0 · Answer 2 · answered Jun 17 '18 at 16:18

We can easily do this using following query.

Note that I am not sure why did you eliminate s2 from the output as it is clearly different in two tables.

Also, if you want to find different records in both tables then S3 will appear twice as flag value is different in both cases.

You can modify the following query and get your results as per your need. As we are joining these tables only once this has much better performance than joining it twice.

select distinct
case when a.id is not null then a.id else b.id end as id,
case when a.sdate is not null then a.sdate else b.sdate end as sdate,
case when a.edate is not null then a.edate else b.edate end as edate,
case when a.tag is not null then a.tag else b.tag end as tag,
case when a.id is not null then 'table1' else 'table2'  end as tb_id
from table1 a
full join table2 b
on a.id=b.id 
and a.sdate=b.sdate 
and a.edate=b.edate 
and a.tag=b.tag
where (a.id is null
and a.sdate is null
and a.edate is null
and a.tag is null) 
or (b.id is null
and b.sdate is null
and b.edate is null
and b.tag is null)

Man · Answer 3 · 2018-06-18T02:39:38.780

-1

select * from (select * from tableA
union DISTINCT  
select * from tableB) as finalTable
where id not in (select * from tableA t1 join tableB t2
              on t1.is=t2.id and t1.sdate=t2.sdate and t1.edate=t2.edate and t1.tag=t2.tag);

first union DISTINCT row and make finalTable . it has all unique row.

then make inner join between two table.

final subtract them ,now you got your answer.

exmaple :

if you subtract first to second then you got [1,4] which is you want

edited Jun 18 '18 at 02:39

answered Jun 17 '18 at 17:27

Man

742
1
6
23

not in require a full cross product and is highly expensive in mapreduce if you cannot perform map joins .. – hlagos Jun 18 '18 at 03:15
i am only using sql not hive so sql query like that – Man Jun 18 '18 at 03:33

Hive external table - Select unmatched record

TableA

TableB

Required output

3 Answers3