0

I have a perfect code that compares the data from one table with another (see below) which works totally fine and runs fine as well in BigQuery:

with source1 as (
select 
b.id, 
b.qty, 
a.price 
from <table> as a
,unnest <details> as b
 where b.status != 'canceled'
),

source2 as (
select id_, qty_, price_  from <table2>
where city != 'delhi'
) 

select *
from source1 s1
full outer join source2 s2
on id = id_
where format('%t', s1) != format('%t', s2)

However, the code above runs into an error in sqlfluff i.e a certain SQL formatting rules checker that I can't bypass or turn off, see the error from sqlfluff below:

ERROR FROM SQLFLUFF:

*'s1' found in select with more than one referenced table/view' and 's2' found in select with more than one referenced table/view

Does anybody know how I can fix it ?

trillion
  • 1,207
  • 1
  • 5
  • 15
  • you should provide sample data and show what is missing - this is the only way I see to be able to help you – Mikhail Berlyant Nov 18 '22 at 04:28
  • I have added more description and the sample data @MikhailBerlyant – trillion Nov 18 '22 at 21:55
  • it is kind of lets say an extension to this question https://stackoverflow.com/questions/74205934/comparing-two-sources-columns-values-in-bigquery that you answered previously – trillion Nov 18 '22 at 21:57
  • looks like the issue is because of use of that 3rd party tool that I am not familiar with, so not sure how I can help further :o( – Mikhail Berlyant Nov 18 '22 at 22:37
  • @MikhailBerlyant Can you help me with the second issue ? that is tackling the where clauses ? – trillion Nov 20 '22 at 15:02
  • @MikhailBerlyant I have shared the sample data and the expected results, the issue is with the pending column and the city column in only one of the tables, which makes the serialization not work properly if I add those column to select statements, do you know how can I achieve the expected results as shown in the picture ? – trillion Nov 20 '22 at 15:04
  • or should i add it as a separate question on StackOverflow ? – trillion Nov 20 '22 at 15:27
  • i think the more nuances you are adding to the same question - the less chances you have to get answer as it becomes too messy. you should be able to split your problems/issues to the separate smaller and more manageable questions and post them as a separate question(s) not as an extension/continuation/follow-up/etc. :o) – Mikhail Berlyant Nov 20 '22 at 17:42
  • @MikhailBerlyant Yes, you are right. Apologies for that. I have added a new question :) here is the link: https://stackoverflow.com/questions/74511941/comparison-of-columns-between-two-tables-with-different-filters – trillion Nov 20 '22 at 21:26

2 Answers2

0

Windows function may perform better.

Instead of joining, the tables are unioned. Then a window function will search for combinations. As you tagged this question [Big-Query] it is tested for BigQuery:

with s1 as (
select id, qty, city from <table> where x != 'pending'
),

s2 as (
select id_, qty_, city_  from <table2>
),

concat_ as (
  select 1 as dummy, * , format('%t',s1) as dummy_all
  from s1
union all select 2 as dummy, * , format('%t',s2) as dummy_all
  from s2
)
,combine as (
  select *,
    sum(if(dummy=1,1,0)) over win1 = sum(if(dummy=2,1,0)) over win1 as dummy_flag
  from concat_
  window win1 as (partition by id,dummy_all)
  )

  Select * from combine
  where dummy_flag is false
Samuel
  • 2,923
  • 1
  • 4
  • 19
0

I fixed the code by adding extra CTE and adjusting the where caluse this would not flag the sqlfluff rules:

To fix that, I tried the second code (see below):

  • added additional ctes
  • adjusted the where clause so that I can get all the rows where the id exists in one source but not in another and vice versa

the code seems to work but it would be great if someone could suggest how I can reduce the CTEs, considering the sqlfluff will not fail:

with s1 as (
select 
b.id, 
b.qty, 
a.price 
from <table> as a
,unnest <details> as b
 where b.status != 'canceled'
),

s2 as (
select id_, qty_, price_  from <table2>
where city != 'delhi'
),

,concat_s1 as (
  select
    *
    , format('%t',s1) as l1
  from s1
)

,concat_s2 as (

  select
    *
    , format('%t',s2) as l2
  from s2
)

, combined as (
  select
    source1.*
    ,source2.*
  from concat_s1 as source1
  full outer join concat_s2 as source2
    on source1.id_ = source2.id_
  where source1.l1 != source2.l2
    or source1.id is null or source2.id_ is null
trillion
  • 1,207
  • 1
  • 5
  • 15