comparing two sources columns values in bigquery

Question

I have a data set as following. To keep it simple I am showing only one column i.e qty but there are multiples for example (dates, prices etc). if any of the value within a single row is a mismatch then that row should show up as output along with the values from both sources, for example:

source 1

id  qty
1   100
2   null
3   0
4   50

source 2

id_ qty_
1   100
2   80
3   100

expected output:

id  qty       id_   qty_   
2   null      2     80
3   0         3     100
4   50        null  null

What i am trying to achieve is to fetch all the rows from source 2 that doesn't match with the source 1. Source 1 is taken as main source/table.

In the end the output should show only row 2,3,4 with both values so that i can build a boolean saying:

if (qty != qty_ , false, true) as is_qty_matching

An example with multiple values:

source 1

id  qty  price 
1   100   10
2   null  0
3   0     0
4   50    0
5   100   30

source 2

id_ qty_ price 
1   100  10
2   80   0
3   100  0
5   100  33

expected output:

    id  qty   price     id_     qty_   price_
    2   null    0         2     80     0
    3   0       0         3     100    0
    4   50      0         null  null   0
    5   100    30         5     100    33

In this case now id 5 has price mismatch

so my two new columns (is_price_matching) would show false for id 5 while true for other id', while the column (is_qty_matching) will show true for id 5 and false for others.

Similarly i will have multiple other columns and therefore multiple other boolean fields for each of the mismatch entry for example is_date_matching, is_time_matching etc.

what's important is that:

The values can be null in either of source for any of the column, this might mess up the boolean maybe
only the rows that has even one single mismatch value should show up in the final output, if there is no mismatch in any of the column values it should not show up

I have tried the following query:

with main as (
select 1 as id , 100 as qty , 50 as price
union all
select 2 as id , 0 as qty , 100 as price
union all
select 3 as id , 0 as qty ,80 as price
union all 
select 4 as id , 50 as qty , 90 as price
union all
select 5 as id , 20 as qty , 100 as price
union all
select 6 as id , 20 as qty , 100 as price
),
main2 as (
select 1 as id_, 100 as qty_ , 50 as price_
union all
select 2 as id_, 80 as qty_ , 100 as price_
union all
select 3 as id_, 100 as qty_ , 80 as price_
union all
select 5 as id_, 20 as qty_ , 100 as price_
union all
select 6 as id_, 20 as qty_ , 40 as price_
)
select 
main.*,
main2.*
from main 
left join main2 
on (main.id = main2.id_) 
WHERE coalesce(qty,0) != coalesce(qty_,0)
or coalesce(main.price,0) != coalesce(main2.price_,0)

it seems to work but i was wondering if there is a better solution ? plus if i add the following line of code

if(qty != qty_ , true , false) as is_qty_mismatch

it will return incorrect output where the values are null

score 1 · Accepted Answer · answered Oct 27 '22 at 04:03

1

Consider below approach

select *
from source1 s1
full outer join source2 s2
on id = id_
where format('%t', s1) != format('%t', s2)

if applied to one of sample data in your question - output is

answered Oct 27 '22 at 04:03

Mikhail Berlyant

165,386
8
154
230

thanks alot I have two questions, would this work if I am also comparing dates / timestamps from two sources ? and the second question is how does the line `format('%t', s1)` compares the columns and gives the correct results, can you please explain the logic :) – trillion Nov 08 '22 at 10:23
1

`format('%t', s1)` is used to serialize the whole row. so `where format('%t', s1) != format('%t', s2)` allows to not to join same row on itself – Mikhail Berlyant Nov 08 '22 at 19:32
somehow the code runs into unqualified reference `'s1' found in select with more than one referenced table/view'` and `'s2' found in select with more than one referenced table/view` . Do you know how can i make the code look better ? we are using sqlfluff which we can't remove for coding perspective. Is there any alternative way to modify the above code ? – trillion Nov 17 '22 at 15:55
`,concat_s1 as ( select * , format('%t',s1) as l1 from s1 ) ,concat_s2 as ( select * , format('%t',s2) as l2 from s2 ) select source1.* ,source2.* from concat_s1 as source1 full outer join concat_s2 as source2 on id= id_ where source1.l1 != source2.l2` - this should produce the same results as above, but i can see some rows are missing – trillion Nov 17 '22 at 17:02
I have added the question seprately here, maybe you can help :) https://stackoverflow.com/questions/74479628/bigquery-code-unexpected-results-with-code-formatting – trillion Nov 17 '22 at 17:25

score 0 · Answer 2 · answered Oct 26 '22 at 10:42

0

There are 2 things that your query doesn't account for.

The second table having ID that's not present in 1st table
You can't compare null to other values (it gives you null).

To fix the best problem changing join from left to full should work (row 31), for the second wrapping your logic with ifnull function. Also if the output for if functions are boolean values, than you can skip the function and keep just logical statement (rows 28 and 29)

with main as (
  select 1 as id , 100 as qty , 50 as price
  union all
  select 2 as id , 0 as qty , 100 as price
  union all
  select 3 as id , 0 as qty ,80 as price
  union all 
  select 4 as id , 50 as qty , 90 as price
  union all
  select 5 as id , 20 as qty , 100 as price
  union all
  select 6 as id , 20 as qty , 100 as price
),
main2 as (
  select 1 as id_, 100 as qty_ , 50 as price_
  union all
  select 2 as id_, 80 as qty_ , 100 as price_
  union all
  select 3 as id_, 100 as qty_ , 80 as price_
  union all
  select 5 as id_, 20 as qty_ , 100 as price_
  union all
  select 6 as id_, 20 as qty_ , 40 as price_
)
select 
  main.*,
  main2.*, 
  ifnull(qty != qty_ , true) as is_qty_mismatch, 
  ifnull(price != price_ , true) as is_qty_mismatch, 
from main 
full join main2 
  on (main.id = main2.id_) 
where coalesce(qty,0) != coalesce(qty_,0)
  or coalesce(main.price,0) != coalesce(main2.price_,0)

answered Oct 26 '22 at 10:42

AlienDeg

1,288
1
13
23

full join would not create any duplicates ? and is there a way to get rid of this coalesce logic as well ? – trillion Oct 26 '22 at 10:54
plus using `and` intstead of `where` would also not work fine ? – trillion Oct 26 '22 at 10:57
Definitely won't create duplicates. As I said, with left join you by default dropping all the rows from table2 not present in table 1. `and` won't work with full join. If quantity is null, does it count as mismatch? If table 1 is the main source, why it has nulls? – AlienDeg Oct 26 '22 at 11:10
With full join, if a row exists in table 2 and not in table 1, then it will show up as well ? table 1 is more reliable hence taken as the main table and yes if it's null or values are different should be counted as a mismatch – trillion Oct 26 '22 at 11:16
In addition to the above comment, i am not sure what would be the difference between the left join and full join here in this case ? , as my original code returns the same rows as the one with the left join – trillion Oct 26 '22 at 11:58
try to add `union all select 7 as id_, 20 as qty_ , 50 as price_` to the second table and it will show you the difference between left and full join. – AlienDeg Oct 26 '22 at 13:37
yes, so i see with the full join I would also get the rows that are not present in the table 2, right ? what If i am only interested in the rows from table 1 meaning even if i add `union all select 7 as id_, 20 as qty_ , 50 as price_` this to the second table, this row should not show up – trillion Nov 02 '22 at 08:23

Jaytiger · Answer 3 · 2022-10-26T13:16:46.380

0

Below might be an another option,

SELECT * FROM (
  SELECT * FROM (SELECT *, '1' src FROM main EXCEPT DISTINCT SELECT *, '1' src FROM main2)
   UNION ALL
  SELECT * FROM (SELECT *, '2' src FROM main2 EXCEPT DISTINCT SELECT *, '2' src FROM main) 
) PIVOT (ANY_VALUE(STRUCT(qty, price)) main FOR src IN ('1', '2'));

EXCEPT DISTINCT

The EXCEPT operator returns rows from the left input query that are not present in the right input query.

Key is how to remove duplicated rows in both tables, you can consider below approach as well using an analytic function.

SELECT * FROM (
  SELECT * FROM (
    SELECT *, '1' src FROM main UNION ALL SELECT *, '2' FROM main2
  ) QUALIFY COUNT(1) OVER (PARTITION BY id, qty, price) = 1
) PIVOT (ANY_VALUE(STRUCT(qty, price)) main FOR src IN ('1', '2'));
-- output will be same as above

edited Oct 26 '22 at 13:16

answered Oct 26 '22 at 12:26

Jaytiger

11,626
2
5
15

could you explain each step as well ? – trillion Oct 26 '22 at 12:33
first **except** in subquery will return rows which only exist in left table (**main**) and second **except** will return ones which exist in right table (**main1**). And if you combine both results in one table, you'll get rows exists in only one table. – Jaytiger Oct 26 '22 at 12:47
and **pivot** query converts rows with same id into columns with **src** postfix in same row. – Jaytiger Oct 26 '22 at 12:51
in this solution I will get the rows from the source 2 as well ? – trillion Nov 02 '22 at 08:24
could you also explain what does the ANY_VALUE and STRUCT is basically doing there ? – trillion Nov 02 '22 at 08:25

comparing two sources columns values in bigquery

3 Answers3

Linked