0

I am working with a datawarehouse table that has can be split into claimed rows, and computed rows. I suspect that the computed rows are perfect duplicates of the claimed row (with the exception of the claimed/computed column).

I tried to test this using the except clause:

But all of the records were returned. I don't believe that this is possible, and I suspect it's due to null values.

Is there a way to compare the records which will compare nulls to nulls?

SELECT a, b, c FROM table WHERE clm_cmp_cd = 'clm'
    EXCEPT
SELECT a, b, c FROM table WHERE clm_cmp_cd = 'cmp'

But all of the records were returned. I don't believe that this is possible, and I suspect it's due to null values.

Is there a way to compare the records which will compare nulls to nulls?

edit: the solution should work with an arbitrary number of fields, with varying types. In this case, I have ~100 fields, 2/3 of which may have null values. This is a data warehouse, and some degree of denormalization must be expected.

edit: I tested the query while limiting myself to non-null columns, and I got the result I expected (nothing). But, I would still like to compare fields which potentially contain null values.

Joshua Farina
  • 175
  • 1
  • 10
  • could try also with `INTERSECT` instead of `EXCEPT`. Look also at https://stackoverflow.com/questions/40680708/db2-select-except-with-where-clause – Traian GEICU Jun 29 '19 at 18:42
  • Possible duplicate of [DB2 SELECT EXCEPT with WHERE clause](https://stackoverflow.com/questions/40680708/db2-select-except-with-where-clause) – Traian GEICU Jun 29 '19 at 18:44
  • Intersect gets a count of matches, which suffers from the same problem. I'm certain that there are duplicates, but they aren't easily identified (I suspect this is due to null values) – Joshua Farina Jun 29 '19 at 19:40

2 Answers2

0

Your supposition would appear to be false. You might try this:

select a, b, c,
       sum(case when clm_cmp_cd = 'clm' then 1 else 0 end) as num_clm,
       sum(case when clm_cmp_cd = 'cmp' then 1 else 0 end) as num_cmp
from t
group by a, b, c;

This will show you the values of the three columns and the number of matches of each type.

Your problem is probably that values that look alike are not exactly the same. This could be due to slight differences in floating point number or due to unmatched characters in the string, such as leading spaces.

Gordon Linoff
  • 1,242,037
  • 58
  • 646
  • 786
  • I'm working with an arbitrary number of fields (in this case, 106). They can be of any type. This solution would show that there is a matched pair, but it doesn't tell me whether the field values have changed. – Joshua Farina Jun 29 '19 at 19:49
  • Also, my hypothesis is that the rows are perfect duplicates (including floats, whitespace, etc...) – Joshua Farina Jun 29 '19 at 21:01
  • @JoshuaFarina What does it mean - `but it doesn't tell me whether the field values have changed`? Can you provide some sample data (not for all 106 columns, of course :)), and the result expected? – Mark Barinstein Jun 30 '19 at 10:07
  • Thank you. I think I needed to think about what you're code was doing for a bit. For some reason, it wasn't intuitive to me. It's actually been very helpful to group by all ~100 fields, and helps isolate where the discrepancies are (a timestamp in milliseconds). – Joshua Farina Jun 30 '19 at 15:09
0

Let's look how Db2 works with NULL values in GROUP BY and INTERSECT:

with t(a, b, clm_cmp_cd) as (values
  (   1, 1, 'clm')
, (   1, 1, 'cmp')
, (null, 1, 'clm')
, (null, 1, 'cmp')
, (   2, 1, 'cmp')
)
select a, b
from t
where clm_cmp_cd='clm'
  intersect
select a, b
from t
where clm_cmp_cd='cmp';

with t(a, b, clm_cmp_cd) as (values
  (   1, 1, 'clm')
, (   1, 1, 'cmp')
, (null, 1, 'clm')
, (null, 1, 'cmp')
, (   2, 1, 'cmp')
)
select a, b
from t
where clm_cmp_cd in ('clm', 'cmp')
group by a, b
having count(1)>1;

Both queries return the same result:

     A  B
    -- --
     1  1
<null>  1

NULL values are treated as the same by these operators.
If you have too many columns in your table to specify them manually in your query, you may produce the column list with the following query:

select listagg(colname, ', ')
from syscat.columns
where tabschema='MYSCHEMA' and tabname='TABLE' and colname<>'CLM_CMP_CD';
Mark Barinstein
  • 11,456
  • 2
  • 8
  • 16