Select all rows containing duplicate values in one of two columns from within distinct groups of related records

Question

I'm trying to create a MySQL query that will return all individual rows (not grouped) containing duplicate values from within a group of related records. By 'groups of related records' I mean those with the same account number (per the sample below).

Basically, within each group of related records that share the same distinct account number, select just those rows whose values for the date or amount columns are the same as another row's values within that account's group of records. Values should only be considered duplicate from within that account's group. The sample table and ideal output details below should clear things up.

Also, I'm not concerned with any records with a status of X being returned, even if they have duplicate values.

Small sample table with relevant data:

id   account   invoice   date         amount   status
1    1         1         2012-04-01   0        X
2    1         2         2012-04-01   120      P
3    1         2         2012-05-01   120      U
4    1         3         2012-05-01   117      U
5    2         4         2012-04-01   82       X
6    2         4         2012-05-01   82       U
7    2         5         2012-03-01   81       P
8    2         6         2012-05-01   80       U
9    3         7         2012-03-01   80       P
10   3         8         2012-04-01   79       U
11   3         9         2012-04-01   78       U

Ideal output returned from desired SQL query:

id   account   invoice   date         amount   status
2    1         2         2012-04-01   120      P
3    1         2         2012-05-01   120      U
4    1         3         2012-05-01   117      U
6    2         4         2012-05-01   82       U
8    2         6         2012-05-01   80       U
10   3         8         2012-04-01   79       U
11   3         9         2012-04-01   78       U

Thus, row 7/9 and 8/9 should not both be returned because their duplicate values are not considered duplicate from within the scope of their respective accounts. However, row 8 should be returned because it shares a duplicate value with row 6.

Later, I may want to further hone the selection by grabbing only duplicate rows that have matching statuses, thus row 2 would be excluded because it does't match the other two found within that account's group of records. How much more difficult would that make the query? Would it just be a matter of adding a WHERE or HAVING clause, or is it more complicated than that?

I hope my explanation of what I'm trying to accomplish makes sense. I've tried using INNER JOIN but that returns each desired row more than once. I don't want duplicates of duplicates.

Table Structure and Sample Values:

CREATE TABLE payment (
  id int(11) NOT NULL auto_increment,
  account int(10) NOT NULL default '0',
  invoice int(10) NOT NULL default '0',
  date date NOT NULL default '0000-00-00',
  amount int(10) NOT NULL default '0',
  status char(1) NOT NULL default '',
  PRIMARY KEY  (id)
);

INSERT INTO payment VALUES (1, 1, 1, '2012-04-01', 0, 'X'); 
INSERT INTO payment VALUES (2, 1, 2, '2012-04-01', 120, 'P'); 
INSERT INTO payment VALUES (3, 1, 2, '2012-05-01', 120, 'U'); 
INSERT INTO payment VALUES (4, 1, 3, '2012-05-01', 117, 'U'); 
INSERT INTO payment VALUES (5, 2, 4, '2012-04-01', 82, 'X'); 
INSERT INTO payment VALUES (6, 2, 4, '2012-05-01', 82, 'U'); 
INSERT INTO payment VALUES (7, 2, 5, '2012-03-01', 81, 'p'); 
INSERT INTO payment VALUES (8, 2, 6, '2012-05-01', 80, 'U'); 
INSERT INTO payment VALUES (9, 3, 7, '2012-03-01', 80, 'U'); 
INSERT INTO payment VALUES (10, 3, 8, '2012-04-01', 79, 'U'); 
INSERT INTO payment VALUES (11, 3, 9, '2012-04-01', 78, 'U');

Sure thing, boss. Added above! Also added the sample values below it. :) — purefusion, May 03 '12 at 13:41
Why isn't the row of `id` 1 included in the output? Its date matches another date within its account group. — Matt Fenwick, May 03 '12 at 13:48
Yeah, sorry forgot to mention need to exclude status='X' from the list. Edited. Also fixed a few inconsistencies in the INSERT sample data, so refresh that if you are using it. :) — purefusion, May 03 '12 at 13:51

Matt Fenwick · Accepted Answer · 2012-05-03T13:59:27.873

10

This type of query can be implemented as a semi join.

Semijoins are used to select rows from one of the tables in the join.

For example:

select distinct l.*
from payment l
inner join payment r
on 
  l.id != r.id and l.account = r.account and
  (l.date = r.date or l.amount = r.amount)
where l.status != 'X' and r.status != 'X'
order by l.id asc;

Note the use of distinct, and that I'm only selecting columns from the left table. This ensures that there are no duplicates.

The join condition checks that:

it's not joining a row to itself (l.id != r.id)
rows are in the same account (l.account = r.account)
and either the date or the amount is the same (l.date = r.date or l.amount = r.amount)

For the second part of your question, you would need to update the on clause in the query.

edited May 03 '12 at 13:59

answered May 03 '12 at 13:53

Matt Fenwick

48,199
22
128
192

Nice, now the only issue is that some rows appear out of order. Simple fix via `ORDER BY id ASC`? – purefusion May 03 '12 at 14:01
Your first version worked a treat, when the ORDER BY was added. Now this version fails to return rows 10/11 for some reason. – purefusion May 03 '12 at 14:13
@purefusion are you sure? It returns them on my machine. I simply copied and pasted your `create table` and `inserts`, and this query. – Matt Fenwick May 03 '12 at 14:22
@purefusion actually I left off `TYPE=InnoDB AUTO_INCREMENT=11` because the `type = InnoDB` causes a syntax error on mine (5.5.11). – Matt Fenwick May 03 '12 at 14:28
Hmm, I deleted the table and recreated it using my CREATE/INSERT code above, then copied your query and ran it on that table. The first time it didn't return 10/11, but the second time I recreated the table it worked just fine! Odd stuff... – purefusion May 03 '12 at 14:30
@purefusion out of curiousity, what version of MySQL are you using? – Matt Fenwick May 03 '12 at 14:31
Heh, I just looked and apparently this test server is running 4.0.20 but it's only used for testing. I should advise IT to upgrade it, since all our production servers are running 5.1.61 – purefusion May 03 '12 at 14:37

score 3 · Answer 2 · answered May 03 '12 at 14:07

3

This seems to work

select * from payment p1
join payment p2 on
(p1.id != p2.id 
 and p1.status != 'X'
 and p1.account = p2.account
 and (p1.amount = p2.amount or p1.date = p2.date))
group by p1.id

http://sqlfiddle.com/#!2/a50e9/3

answered May 03 '12 at 14:07

goat

31,486
7
73
96

This returns all columns twice, and also fails to return rows 10/11. :\ – purefusion May 03 '12 at 14:15
It produces the rows for id's 10 and 11. Check the fiddle, and use `p1.*` – goat May 03 '12 at 14:20
Yep, works on my table now, after recreating the table. Must've been a fluke. Well, I definitely appreciate seeing another approach. Thanks! :) – purefusion May 03 '12 at 14:34

Select all rows containing duplicate values in one of two columns from within distinct groups of related records

Small sample table with relevant data:

Ideal output returned from desired SQL query:

Table Structure and Sample Values:

2 Answers2