-1

I have 2 tables. t1.uuid has 630,000 distinct values. t2.uuid has 300,000 distinct values.

When I run

SELECT 
    t1.uuid
    , t2.uuid
FROM 
    t1 --630,000 uuids
    LEFT OUTER JOIN t2 -- 300,000 uuids
        ON t1.uuid = t2.uuid 
WHERE
    t2.uuid IS NULL

There are no results.

Edit1: For clarification: t1 looks like:

uuid
ufo123
abc456
def789

t2 looks like

uuid
ufo123
def789

Every record in t2 has a match in t1. Not every record in t1 has a match in t2 (as indicated by the size of the tables.

I expect the join to result in:

t1.uuid t2.uuid
ufo123 ufo123
abc456 NULL
def789 def789

And the result of my query to by:

t1.uuid t2.uuid
abc456 NULL

But instead I get no results. I'm not sure if there is a communication issue between Redshift and DBEAVER to produce this behavior.

Edit2: I ran the following which resulted in only 300,000 records (same as only t2):

SELECT 
    COUNT(DISTINCT t1.uuid)
FROM 
    t1 --630,000 uuids
    FULL JOIN t2 -- 300,000 uuids
        ON t1.uuid = t2.uuid 

This is completely wrong.

John Rotenstein
  • 241,921
  • 22
  • 380
  • 470
Syd
  • 65
  • 8
  • Perhaps there are no matches? I'd also try to remove the where clause and see what's happening. Can you share sample dataset? – Pirate X Jul 12 '22 at 18:03
  • @PirateX rather, there are *only* matches. For every `t1.uuid` a `t2.uuid` is available – HoneyBadger Jul 12 '22 at 18:19
  • You should explain why you are expecting different behavior. Please show some sample data and expected results – HoneyBadger Jul 12 '22 at 18:23
  • The only reason this should happen is that t2 contains every single uuid in t1. Sorry to ask, but this is baffling me - are you sure the tables are the right way round? – Pelayo Martinez Jul 12 '22 at 18:33
  • There are matches. When I do an Inner Join, every record from t2 (300,000 records) are returned. When I remove the WHERE clause, it is the same result as the Inner Join. This is very strange behavior. – Syd Jul 12 '22 at 19:00
  • Can you share the DDL for these tables? Specifically are uuid's defined as PKs or in any other way special? – Bill Weiner Jul 12 '22 at 21:25

2 Answers2

1

The problem is that you cannot check NULL with equal sign in Redshift. "NULL = NULL" is false.

You need to expand your JOIN ON clause (I left it general case which isn't needed with you WHERE clause):

SELECT 
    t1.uuid
    , t2.uuid
FROM 
    t1 --630,000 uuids
    LEFT OUTER JOIN t2 -- 300,000 uuids
        ON (t1.uuid = t2.uuid) OR (t1.uuid IS NULL AND t2.uuid IS NULL)
WHERE
    t2.uuid IS NULL

The problem is that when both t1 and t2 have NULL uuid these will all join to each other and could explode your results. So you may want to check this in a different way if there are a lot of NULLs.

As for an explanation let me address the "why" of this. This has to do with the Redshift inferring a WHERE clause from your SQL. It goes like this - if t2.uuid = X and t1.uuid = t2.uuid then Redshift should be able to apply the WHERE clause t1.uuid = X. But this produces the null set give the reasons above. Redshift does this to reduce the scanned data as much as possible.

Update:

Note that you are recreating the EXCEPT clause with this LEFT OUTER JOIN. You will likely get better results with EXCEPT.

Bill Weiner
  • 8,835
  • 2
  • 7
  • 18
  • 1
    `LEFT OUTER JOIN t2 ON t1.uuid= t2.uuid` would produce the same result as your query. – Pirate X Jul 12 '22 at 18:31
  • Yes possibly, I didn't test. The point is that you need to defeat Redshift inferred WHERE clause. I think this will do it but again didn't test. So while logically equivalent in some idealized database, not going to do the same thing in Redshift. – Bill Weiner Jul 12 '22 at 19:29
  • This did not work. Also, a full join does not return all records (see edit2 in my post). This seems to be a redshift/dbeaver specific problem – Syd Jul 12 '22 at 20:01
  • You can check stl_statementtext to see if dbeaver is somehow rewriting the query and if not then this is only a Redshift concern. Given edit2 results looking at table DDL and table data seems to be needed. Are there any PKs defined? What does "select count(distinct uuid) from t1;" yield? – Bill Weiner Jul 13 '22 at 15:36
0

Your logic looks fine. Testing on a sample data. 3 records in table a and 1 record in table b. Output is 2 records that don't match.

WITH t1 AS (SELECT 1 AS col UNION SELECT 2 UNION SELECT 3)
   , t2 AS (SELECT 1 AS col)
SELECT *
FROM t1
     LEFT JOIN t2 ON t1.col = t2.col
where t2.col is null

Output

+------+-------+
|t1.col|t2.col |
+------+-------+
|2     |NULL   |
|3     |NULL   |
+------+-------+

I'd recommend checking the data that it's matching on. As Pelayo Martinez mentioned, it's highly likely your LEFT table is T2 instead of T1. Only then it'll be a no row result.

Pirate X
  • 3,023
  • 5
  • 33
  • 60
  • This did not work. Also, a full join does not return all records (see edit2 in my post). This seems to be a redshift/dbeaver specific problem not an SQL syntax issue. – Syd Jul 12 '22 at 20:02
  • I'm running DBWeaver too and I get correct results. I'm on driver `JDBC version 2.1.0.3`. Perhaps trying updating and then rerunning – Pirate X Jul 12 '22 at 23:43