6

I'm trying to create a MySQL query that will return all individual rows (not grouped) containing duplicate values from within a group of related records. By 'groups of related records' I mean those with the same account number (per the sample below).

Basically, within each group of related records that share the same distinct account number, select just those rows whose values for the date or amount columns are the same as another row's values within that account's group of records. Values should only be considered duplicate from within that account's group. The sample table and ideal output details below should clear things up.

Also, I'm not concerned with any records with a status of X being returned, even if they have duplicate values.

Small sample table with relevant data:

id   account   invoice   date         amount   status
1    1         1         2012-04-01   0        X
2    1         2         2012-04-01   120      P
3    1         2         2012-05-01   120      U
4    1         3         2012-05-01   117      U
5    2         4         2012-04-01   82       X
6    2         4         2012-05-01   82       U
7    2         5         2012-03-01   81       P
8    2         6         2012-05-01   80       U
9    3         7         2012-03-01   80       P
10   3         8         2012-04-01   79       U
11   3         9         2012-04-01   78       U

Ideal output returned from desired SQL query:

id   account   invoice   date         amount   status
2    1         2         2012-04-01   120      P
3    1         2         2012-05-01   120      U
4    1         3         2012-05-01   117      U
6    2         4         2012-05-01   82       U
8    2         6         2012-05-01   80       U
10   3         8         2012-04-01   79       U
11   3         9         2012-04-01   78       U

Thus, row 7/9 and 8/9 should not both be returned because their duplicate values are not considered duplicate from within the scope of their respective accounts. However, row 8 should be returned because it shares a duplicate value with row 6.

Later, I may want to further hone the selection by grabbing only duplicate rows that have matching statuses, thus row 2 would be excluded because it does't match the other two found within that account's group of records. How much more difficult would that make the query? Would it just be a matter of adding a WHERE or HAVING clause, or is it more complicated than that?

I hope my explanation of what I'm trying to accomplish makes sense. I've tried using INNER JOIN but that returns each desired row more than once. I don't want duplicates of duplicates.

Table Structure and Sample Values:

CREATE TABLE payment (
  id int(11) NOT NULL auto_increment,
  account int(10) NOT NULL default '0',
  invoice int(10) NOT NULL default '0',
  date date NOT NULL default '0000-00-00',
  amount int(10) NOT NULL default '0',
  status char(1) NOT NULL default '',
  PRIMARY KEY  (id)
);

INSERT INTO payment VALUES (1, 1, 1, '2012-04-01', 0, 'X'); 
INSERT INTO payment VALUES (2, 1, 2, '2012-04-01', 120, 'P'); 
INSERT INTO payment VALUES (3, 1, 2, '2012-05-01', 120, 'U'); 
INSERT INTO payment VALUES (4, 1, 3, '2012-05-01', 117, 'U'); 
INSERT INTO payment VALUES (5, 2, 4, '2012-04-01', 82, 'X'); 
INSERT INTO payment VALUES (6, 2, 4, '2012-05-01', 82, 'U'); 
INSERT INTO payment VALUES (7, 2, 5, '2012-03-01', 81, 'p'); 
INSERT INTO payment VALUES (8, 2, 6, '2012-05-01', 80, 'U'); 
INSERT INTO payment VALUES (9, 3, 7, '2012-03-01', 80, 'U'); 
INSERT INTO payment VALUES (10, 3, 8, '2012-04-01', 79, 'U'); 
INSERT INTO payment VALUES (11, 3, 9, '2012-04-01', 78, 'U');
purefusion
  • 943
  • 1
  • 15
  • 23

2 Answers2

10

This type of query can be implemented as a semi join.

Semijoins are used to select rows from one of the tables in the join.

For example:

select distinct l.*
from payment l
inner join payment r
on 
  l.id != r.id and l.account = r.account and
  (l.date = r.date or l.amount = r.amount)
where l.status != 'X' and r.status != 'X'
order by l.id asc;

Note the use of distinct, and that I'm only selecting columns from the left table. This ensures that there are no duplicates.

The join condition checks that:

  • it's not joining a row to itself (l.id != r.id)
  • rows are in the same account (l.account = r.account)
  • and either the date or the amount is the same (l.date = r.date or l.amount = r.amount)

For the second part of your question, you would need to update the on clause in the query.

Matt Fenwick
  • 48,199
  • 22
  • 128
  • 192
  • Nice, now the only issue is that some rows appear out of order. Simple fix via `ORDER BY id ASC`? – purefusion May 03 '12 at 14:01
  • Your first version worked a treat, when the ORDER BY was added. Now this version fails to return rows 10/11 for some reason. – purefusion May 03 '12 at 14:13
  • @purefusion are you sure? It returns them on my machine. I simply copied and pasted your `create table` and `inserts`, and this query. – Matt Fenwick May 03 '12 at 14:22
  • @purefusion actually I left off `TYPE=InnoDB AUTO_INCREMENT=11` because the `type = InnoDB` causes a syntax error on mine (5.5.11). – Matt Fenwick May 03 '12 at 14:28
  • Hmm, I deleted the table and recreated it using my CREATE/INSERT code above, then copied your query and ran it on that table. The first time it didn't return 10/11, but the second time I recreated the table it worked just fine! Odd stuff... – purefusion May 03 '12 at 14:30
  • @purefusion out of curiousity, what version of MySQL are you using? – Matt Fenwick May 03 '12 at 14:31
  • Heh, I just looked and apparently this test server is running 4.0.20 but it's only used for testing. I should advise IT to upgrade it, since all our production servers are running 5.1.61 – purefusion May 03 '12 at 14:37
3

This seems to work

select * from payment p1
join payment p2 on
(p1.id != p2.id 
 and p1.status != 'X'
 and p1.account = p2.account
 and (p1.amount = p2.amount or p1.date = p2.date))
group by p1.id

http://sqlfiddle.com/#!2/a50e9/3

goat
  • 31,486
  • 7
  • 73
  • 96