Is it possible to find duplicating records in two columns simultaneously in PostgreSQL?

Question

I have the following database schema (oversimplified):

create sequence partners_partner_id_seq;


create table partners
(
    partner_id      integer      default nextval('partners_partner_id_seq'::regclass) not null primary key,
    name            varchar(255) default NULL::character varying,
    company_id      varchar(20)  default NULL::character varying,
    vat_id          varchar(50)  default NULL::character varying,
    is_deleted      boolean      default false not null
);

INSERT INTO partners(name, company_id, vat_id) VALUES('test1','1010109191191', 'BG1010109191192');
INSERT INTO partners(name, company_id, vat_id) VALUES('test2','1010109191191', 'BG1010109191192');
INSERT INTO partners(name, company_id, vat_id) VALUES('test3','3214567890102', 'BG1010109191192');
INSERT INTO partners(name, company_id, vat_id) VALUES('test4','9999999999999', 'GE9999999999999');

I am trying to figure out how to return test1, test2 (because the company_id column value duplicates vertically) and test3 (because the vat_id column value duplicates vertically as well).

To put it in other words - I need to find duplicating company_id and vat_id records and group them together, so that test1, test2 and test3 would be together, because they duplicate by company_id and vat_id.

So far I have the following query:

SELECT *
FROM (
    SELECT *, LEAD(row, 1) OVER () AS nextrow
    FROM (
        SELECT *, ROW_NUMBER() OVER (w) AS row
        FROM partners
        WHERE is_deleted = false
        AND ((company_id != '' AND company_id IS NOT null) OR (vat_id != '' AND vat_id IS NOT NULL))
        WINDOW w AS (PARTITION BY company_id, vat_id ORDER BY partner_id DESC)
    ) x
) y
WHERE (row > 1 OR nextrow > 1) 
AND is_deleted = false

This successfully shows all company_id duplicates, but does not appear to show vat_id ones - test3 row is missing. Is this possible to be done within one query?

Here is a db-fiddle with the schema, data and predefined query reproducing my result.

Mike Organek · Accepted Answer · 2022-09-28T16:55:20.560

You can do this with recursion, but depending on the size of your data you may want to iterate, instead.

The trick is to make the name just another match key instead of treating it differently than the company_id and vat_id:

create table partners (
  partner_id integer generated always as identity primary key,
  name text,
  company_id text,
  vat_id text,
  is_deleted boolean not null default false
);

insert into partners (name, company_id, vat_id) values 
('test1','1010109191191', 'BG1010109191192'),
('test2','1010109191191', 'BG1010109191192'),
('test3','3214567890102', 'BG1010109191192'),
('test4','9999999999999', 'GE9999999999999'),
('test5','3214567890102', 'BG8888888888888'),
('test6','2983489023408', 'BG8888888888888')
;

I added a couple of test cases and left in the lone partner.

with recursive keys as (
  select partner_id, 
         array['n_'||name, 'c_'||company_id, 'v_'||vat_id] as matcher,
         array[partner_id] as matchlist,
         1 as size
    from partners
), matchers as (
  select *
    from keys
  union all
  select p.partner_id, c.matcher,
         p.matchlist||c.partner_id as matchlist,
         p.size + 1
    from matchers p
         join keys c
           on c.matcher && p.matcher
          and not p.matchlist @> array[c.partner_id]
), largest as (
  select distinct sort(matchlist) as matchlist
    from matchers m
   where not exists (select 1
                       from matchers
                      where matchlist @> m.matchlist
                        and size > m.size)
  -- and size > 1
)
select *
  from largest
;

matchlist
{1,2,3,5,6}
{4}

fiddle

EDIT UPDATE

Since recursion did not perform, here is an iterative example in plpgsql that uses a temporary table:

create temporary table match1 (
  partner_id int not null,
  group_id int not null,
  matchkey uuid not null
);

create index on match1 (matchkey);
create index on match1 (group_id);

insert into match1 
select partner_id, partner_id, md5('n_'||name)::uuid from partners
union all
select partner_id, partner_id, md5('c_'||company_id)::uuid from partners
union all
select partner_id, partner_id, md5('v_'||vat_id)::uuid from partners;

do $$
  declare _cnt bigint;
  begin
    loop
      with consolidate as (
        select group_id, 
               min(group_id) over (partition by matchkey) as new_group_id
          from match1
      ), minimize as (
        select group_id, min(new_group_id) as new_group_id
          from consolidate
         group by group_id
      ), doupdate as (
        update match1
           set group_id = m.new_group_id
          from minimize m
         where m.group_id = match1.group_id
           and m.new_group_id != match1.group_id
        returning *
      )
      select count(*) into _cnt from doupdate;
      if _cnt = 0 then 
        exit;
      end if;
    end loop;
  end;
$$;

updated fiddle

Hey thanks for your answer! This query, however, does not execute on a real PostgreSQL server: ```ERROR: function sort(integer[]) does not exist; Hint: No function matches the given name and argument types. You might need to add explicit type casts. Position: 561``` — tftd, Sep 28 '22 at 15:42
@tftd Sorry. Are you able to load the intarray extension? https://www.postgresql.org/docs/13/intarray.html — Mike Organek, Sep 28 '22 at 15:43
Yeah, after doing `CREATE EXTENSION intarray` it executes. Unfortunately, as you already pointed out, this query has very slow performance on about 100k data set. The query I initially posted returned the duplicates within less than a second and this has been running for more than 5 minutes now. This is supposed to show on a web page so offline processing would not be possible. Nevertheless I will upvote this to appreciate your effort! — tftd, Sep 28 '22 at 15:53
@tftd If you cannot add the intarray extension, just add a CTE to sort: https://dbfiddle.uk/yI_dBDwR — Mike Organek, Sep 28 '22 at 15:53
@tftd Here is an iteration example using a temporary table that should help with performance: https://dbfiddle.uk/-ZLk4BAM — Mike Organek, Sep 28 '22 at 16:42

Is it possible to find duplicating records in two columns simultaneously in PostgreSQL?

1 Answers1