0

This is using Postgres 12

I have a table that stages an import from a spreadsheet. It has a structure like

code   | yes  | no
XX1000 | yes  | null
ZX1001 | null | no

I am trying to get all the results in a query so I can do other stuff with it.

When I run

select substring(code, 3), 'Y' from table1 where yes = 'yes' and no is null

I get the correct number of results (lets say 100).

When I run

select substring(code, 3), 'N' from table1 where yes is null and no is not null

I get the proper number of results (lets say 6)

If I run

select substring(code, 3), 'Y' from table1 where yes = 'yes' and no is null
UNION
select substring(code, 3), 'N' from table1 where yes is null and no is not null

I get 102 results, with 4 results missing from the yes query. Extracting all the results and comparing in Excel, I can see that there are not any values in the substring results that duplicate for both queries (e.g. each id is in the result set of the yes query once). I can also guarantee there are no overlapping substring(code, 3) values since the spreadsheet is populated from a different system where the values after the two characters are the id column in the table (I also verified the second query run separately returns distinct values compared to the first query).

Running a UNION ALL gives me 106 results that are all unique.

What is going on here? I am so confused why the UNION is dropping unique results.

scottr
  • 65
  • 1
  • 6
  • 2
    This is kind of hard to believe. Can you strip down the table which matches the 'yes' query down to, say, 6 rows which reproduce the issue, then post all 12 rows? – jjanes May 13 '20 at 00:38
  • I've seen unexpected behavior in UNION in other versions of PostgreSQL. I've never bothered to figure out what's going on. I usually just use UNION ALL – bfris May 13 '20 at 00:54
  • I bet there are some duplicate rows in your result that you just didn't notice. Remember that `UNION` treats two NULLs as equal. – Laurenz Albe May 13 '20 at 03:29
  • @LaurenzAlbe If I am guaranteed that only 1 of the columns yes or no is populated, how would the queries above produce a dupe record? The dataset is small enough that I have verified the guarantee (e.g. one of the columns is populated, one is null). – scottr May 13 '20 at 12:39
  • @jjanes that's a good idea. I will pair it down to those records I know are being removed and see if they are still removed when they are the only ones in the first dataset. – scottr May 13 '20 at 12:39
  • Right. One of the two result sets will contain a duplicate. – Laurenz Albe May 13 '20 at 12:41
  • Are you also guaranteed that the code column in unique first 3 characters. For example a row containing "XX1001|yes|null" would produce a duplicate with 'XX1000|yes|null". Since you only select 3 characters (not the full code) both produce "XX1|yes|null" UNION would discard 1 of then but UNION ALL would retain both. – Belayer May 13 '20 at 21:08
  • The root of the confusion / the misunderstanding is most likely this: `UNION` not only eliminates duplicates between the left and right set, but ***also*** duplicates ***within*** each set. – Erwin Brandstetter May 22 '20 at 01:22

1 Answers1

0

According documentation UNION effectively appends the result of query2 to the result of query1 (although there is no guarantee that this is the order in which the rows are actually returned). Furthermore, it eliminates duplicate rows from its result, in the same way as DISTINCT, unless UNION ALL is used.

Slava Rozhnev
  • 9,510
  • 6
  • 23
  • 39