Why query within query results in duplicate records

Question

I am unable to understand this.

SELECT COUNT(*) FROM profiles
WHERE profiles.status IN ('abc', 'man')
  AND profiles.id IN (
                        SELECT artifacts.item_id FROM artifacts
                        WHERE artifacts.deleted_at IS NULL
                          AND artifacts.item_type = 'Profile'
                          AND artifacts.upload_type = 'bill'
                     );
 count 
-------
 12514
(1 row)

Above query counts duplicate records of profiles (for which artifacts have multiple records). When I run the above query with distinct I get correct count which is below.

SELECT COUNT(DISTINCT(id)) FROM profiles
WHERE profiles.status IN ('abc', 'man')
  AND profiles.id IN (
                        SELECT artifacts.item_id FROM artifacts
                        WHERE artifacts.deleted_at IS NULL
                          AND artifacts.item_type = 'Profile'
                          AND artifacts.upload_type = 'bill'
                     );
 count 
-------
 12157
(1 row)

Artifacts can have more than one records for same profile. But as per my understanding IN query will not let any duplicate profiles to come in count. Am I right? or is there any thing I am missing?

UPDATE:

I tried to reduce the query to 2 different filtering conditions. Both conditions works fine. Please find below.

=> SELECT COUNT(*) FROM profiles WHERE profiles.id IN (
            SELECT artifacts.item_id FROM artifacts 
            WHERE artifacts.deleted_at IS NULL 
            AND artifacts.item_type = 'Profile' 
            AND artifacts.upload_type = 'bill');
 count 
-------
 22664
(1 row)

=> SELECT COUNT(DISTINCT(id)) FROM profiles WHERE profiles.id IN (
            SELECT artifacts.item_id FROM artifacts 
            WHERE artifacts.deleted_at IS NULL 
            AND artifacts.item_type = 'Profile' 
            AND artifacts.upload_type = 'bill');
 count 
-------
 22664
(1 row)


=> SELECT COUNT(DISTINCT(id)) FROM profiles 
        WHERE profiles.status IN ('abc', 'man');
 count 
-------
 20109
(1 row)

=> SELECT COUNT(*) FROM profiles 
        WHERE profiles.status IN ('abc', 'man');
 count 
-------
 20109

So duplication occurs when two IN queries used in conjuction. Is any one familiar with such use case.

Here's a stupid question - any chance you have duplicate ids in the profiles table? — paqash, Feb 23 '17 at 09:00
try running `SELECT id FROM profiles... except SELECT distinct id FROM profiles..` to get the list of "not distinct id"?.. — Vao Tsun, Feb 23 '17 at 09:26

score 0 · Answer 1 · answered Feb 23 '17 at 09:16

There are two possibilities:

id is not unique in profiles.

You could run the following query to investigate this:

SELECT profiles.id, count(*) FROM profiles
WHERE profiles.status IN ('abc', 'man')
  AND profiles.id IN (
                        SELECT artifacts.item_id FROM artifacts
                        WHERE artifacts.deleted_at IS NULL
                          AND artifacts.item_type = 'Profile'
                          AND artifacts.upload_type = 'bill'
                     )
GROUP BY profiles.id
HAVING count(*) > 1;

That will return the ids that are duplicate.

Are you missing a UNIQUE or PRIMARY KEY constraint on that column?

If there is a UNIQUE or PRIMARY KEY constraint on id, you are facing data corruption. Look at the query plan – does it use index scans or sequential scans?

If setting enable_indexscan, enable_bitmapscan and enable_indexonlyscan to off fixes the problem, you have a corrupted index. REINDEX TABLE profiles will probably fix the problem.

If the query also returns bad results if only sequential scans are used, you are facing table corruption. Restore from the last good backup.

In any case, if it was data corruption, find its cause and fix it. It might be flawed RAM or storage, or a server crash on storage that does not honor sync requests properly. Read the database logs!

Why query within query results in duplicate records

1 Answers1