0

I have a table with about 11 fields (columns), and I know for a fact that there are multiple repeating entries, but they are repeating only within few fields. What I want to do is that if two or more rows have identical ProviderName and Address in lowercase, and identical StateID, then I want to delete those extra rows, and leave just one copy of it. I have something like this, but I have no idea how to complete it.

SET SQL_SAFE_UPDATEs=0;
DELETE FROM providers
WHERE LCASE(ProviderName), LCASE(Address), StateID;

UPDATE:

SET SQL_SAFE_UPDATES=0;
DELETE p1.* 
FROM providers AS p1
JOIN providers AS p2 
ON LCASE(p1.ProviderName) = LCASE(p2.ProviderName)
AND LCASE(p1.Address) = LCASE(p2.Address)
AND p1.Zip = p2.Zip
AND p1.StateID = p2.StateID
AND p1.ProviderId > p2.ProviderId
tett
  • 595
  • 3
  • 13
  • 34

2 Answers2

1

This answer assumes you have a unique id field that distinguishes the rows that have duplicate values. It will keep the row with the lowest id.

DELETE p1.*
FROM providers AS p1
JOIN providers AS p2 
ON LCASE(p1.ProviderName) = LCASE(p2.ProviderName)
AND LCASE(p1.Address) = LCASE(p2.Address)
AND p1.StateID = p2.StateID
AND p1.id > p2.id

Note that if your table uses case-insensitive collation, you don't need to call LCASE.

A more efficient query might be:

DELETE p1.*
FROM providers AS p1
JOIN (SELECT LCASE(ProviderName) AS Name, LCASE(Address) AS Addr, StateID, MIN(id) AS id
      FROM providers
      GROUP BY Name, Addr, StateID
      HAVING COUNT(*) > 1) AS p2
ON LCASE(p1.ProviderName) = p2.Name
AND LCASE(p1.Address) = p2.Addr
AND p1.StateID = p2.StateID
AND p1.id > p2.id

The cross-product is much smaller because we first group all the providers with identical values. Also, the HAVING clause filters out all the non-duplicates.

Barmar
  • 741,623
  • 53
  • 500
  • 612
  • Judging by your experience wouldn't you mind take a look at my [question](http://stackoverflow.com/questions/28614086/mysql-stored-functions-and-groupwise-min)? – RuslanN Feb 20 '15 at 13:43
  • Can you check my updated code, which is yours actually, when I execute it I get the following message: `Error Code: 1205. Lock wait timeout exceeded; try restarting transaction`. I have unique `ProviderId` for each, also please check this screenshot, when I do select using your code: http://prntscr.com/67jer1 – tett Feb 20 '15 at 14:20
  • How big is your table? This query can't be indexed very easily, so it will be very slow if the table is large. – Barmar Feb 20 '15 at 14:22
  • @Barmar I have around 9 million entries in my table. – tett Feb 20 '15 at 14:23
  • It would help a lot if you used a case-insensitive collation, and had a composite index on the columns you're comparing. You can't index a query that compares the results of a function call. – Barmar Feb 20 '15 at 14:29
  • @Barmar Wit the updated query I got an error message: `Error Code: 2013. Lost connection to MySQL server during query`, I guess it is still taking too long. And how will the query look if we ignored case-sensitivity? You said it will be faster, but how will the best version of it look? I want to try it. – tett Feb 20 '15 at 15:55
  • Just take out all the `LCASE` calls. – Barmar Feb 20 '15 at 16:09
  • What indexes do you have on the table? – Barmar Feb 20 '15 at 21:48
  • ProviderId is my unique primary key, and StateID is a foreign key, which comes from my other table. Here you go: http://prntscr.com/67wyqu – tett Feb 21 '15 at 11:05
  • You need indexes on the ProviderName and Address to make this query faster. – Barmar Feb 22 '15 at 12:15
  • What kind of index I can put to them? Can you give some example? – tett Feb 22 '15 at 22:02
  • `CREATE INDEX name_addr ON provider (ProviderName, Address, StateID)` – Barmar Feb 23 '15 at 03:30
  • I got an error message saying: `Error Code: 1170. BLOB/TEXT column 'ProviderName' used in key specification without a key length` – tett Feb 24 '15 at 00:04
  • Do you really need to use `TEXT` datatype for that field? – Barmar Feb 24 '15 at 00:35
  • I changed them from LONGTEXT to VARCHAR(200), and executed the code that you wrote for creating index, but again I got error message `Lost connection to MySQL server during query.`. Though, afterwards when I looked under the index part of the database, the new index was there. Therefore, I tried to execute the query that you give, but I still get the same error: `Lost connection to MySQL server during query. ` – tett Feb 24 '15 at 15:35
  • Sounds like you need to check the server's log, it sounds like it's crashing. You may be running out of memory for the index. – Barmar Feb 24 '15 at 15:56
  • It this possible with a local db? Because I have local db, it is not hosted anywhere. – tett Feb 24 '15 at 23:17
  • Yes. Even if you're not hosted, there's still a `mysqld` daemon process that acts as the local server. It should have a log. I'm not sure where it's located, though. – Barmar Feb 25 '15 at 05:35
  • I noticed that my DBMS connection read time out (in seconds) in MySQL Workbench was set to 600, and I change it to something higher, like 86400. Afterwards, when I ran the command for creating the index, it finished in about 9 minutes. Afterwards, I ran the command you gave about for deleting the duplicates. It has been running for 3 hours now, and it is still not finished, isn't there any way to speed-up this? you can check it out: http://prntscr.com/6a3g8j – tett Feb 26 '15 at 16:04
  • @Barmar In fact, it just finished, with this error message: http://prntscr.com/6a4mp1 – tett Feb 26 '15 at 17:33
  • Is your Search box not working? Just search for the error message and you'll find: http://stackoverflow.com/questions/6901108/the-total-number-of-locks-exceeds-the-lock-table-size – Barmar Feb 26 '15 at 17:36
0

To leave entry with highest id:

SET SQL_SAFE_UPDATES = 0;
DELETE n1 FROM provider n1, provider n2 WHERE n1.id < n2.id AND   
LCASE(n1.ProviderName) = LCASE(n2.ProviderName) and LCASE(n1.Address) =   
LCASE(n2.Address)   
and n1.StateId = n2.StateId
RuslanN
  • 398
  • 3
  • 18