0

I originally asked this question here.

I'm using the following query to return all duplicate records with the same first and last name. The trick is that the contact_id, has to be in descending order.

Problem is that the database has a few million records in the "contacts" table. They queory takes several minutes to complete.

I have the contact_firstName, contact_lastName, contact_client_id, and contact_id all indexed in the database.

Any other ideas on how this query can be optimized a little further?

SELECT c.contact_id, c.contact_purl, c.contact_firstName, c.contact_lastName, c.contact_organization 
FROM (
        SELECT contact_purl, contact_firstName, contact_lastName, MIN(contact_id) AS MinID 
        FROM contacts 
        WHERE contact_client_id = 1 
        GROUP BY contact_purl HAVING COUNT(contact_id) > 1)  t 
INNER JOIN contacts c 
ON t.contact_purl = c.contact_purl  
AND c.contact_client_id = 1
AND t.MinID <> c.contact_id  
ORDER BY contact_id asc 

EXPLAIN: enter image description here

SCHEMA:

    CREATE TABLE IF NOT EXISTS `contacts` (
  `contact_id` int(11) NOT NULL AUTO_INCREMENT,
  `contact_client_id` int(11) DEFAULT NULL,
  `contact_sales_id` int(11) DEFAULT NULL,
  `contact_campaign_id` int(11) DEFAULT NULL,
  `contact_purl` varchar(100) NOT NULL,
  `contact_purl1` varchar(50) DEFAULT NULL,
  `contact_purl2` varchar(50) DEFAULT NULL,
  `contact_firstName` varchar(50) NOT NULL,
  `contact_lastName` varchar(50) NOT NULL,
  `contact_organization` varchar(100) DEFAULT NULL,
  `contact_url_organization` varchar(200) DEFAULT NULL,
  `contact_position` varchar(50) DEFAULT NULL,
  `contact_email` varchar(100) DEFAULT NULL,
  `contact_phone` varchar(20) DEFAULT NULL,
  `contact_fax` varchar(20) NOT NULL,
  `contact_address1` varchar(100) DEFAULT NULL,
  `contact_address2` varchar(100) DEFAULT NULL,
  `contact_city` varchar(100) DEFAULT NULL,
  `contact_state` varchar(20) DEFAULT NULL,
  `contact_zip` varchar(10) DEFAULT NULL,
  `contact_IP` varchar(50) DEFAULT NULL,
  `contact_timestamp` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP,
  `contact_pw` varchar(200) NOT NULL,
  `contact_subscribed` varchar(1) NOT NULL DEFAULT 'Y',
  `contact_import` varchar(200) DEFAULT NULL,
  `contacts_c_1` varchar(500) DEFAULT NULL,
  `contacts_c_2` varchar(500) DEFAULT NULL,
  `contacts_c_3` varchar(500) DEFAULT NULL,
  `contacts_c_4` varchar(500) DEFAULT NULL,
  `contacts_c_5` varchar(500) DEFAULT NULL,
  `contacts_c_6` varchar(500) DEFAULT NULL,
  `contacts_c_7` varchar(500) DEFAULT NULL,
  `contacts_c_8` varchar(500) DEFAULT NULL,
  `contacts_c_9` varchar(500) DEFAULT NULL,
  `contacts_c_10` varchar(500) DEFAULT NULL,
  `contacts_c_11` varchar(500) DEFAULT NULL,
  `contacts_c_12` varchar(500) DEFAULT NULL,
  `contacts_c_13` varchar(500) DEFAULT NULL,
  `contacts_c_14` varchar(500) DEFAULT NULL,
  `contacts_c_15` varchar(500) DEFAULT NULL,
  `contacts_c_16` varchar(500) DEFAULT NULL,
  `contacts_c_17` varchar(500) DEFAULT NULL,
  `contacts_c_18` varchar(500) DEFAULT NULL,
  `contacts_c_19` varchar(500) DEFAULT NULL,
  `contacts_c_20` varchar(500) DEFAULT NULL,
  `contacts_c_21` varchar(500) DEFAULT NULL,
  `contacts_c_22` varchar(500) DEFAULT NULL,
  `contacts_c_23` varchar(500) DEFAULT NULL,
  `contacts_c_24` varchar(500) DEFAULT NULL,
  `contacts_c_25` varchar(500) DEFAULT NULL,
  `contacts_c_26` varchar(500) DEFAULT NULL,
  `contacts_c_27` varchar(500) DEFAULT NULL,
  `contacts_c_28` varchar(500) DEFAULT NULL,
  `contacts_c_29` varchar(500) DEFAULT NULL,
  `contacts_c_30` varchar(500) DEFAULT NULL,
  `contacts_c_31` varchar(500) DEFAULT NULL,
  `contacts_c_32` varchar(500) DEFAULT NULL,
  `contacts_c_33` varchar(500) DEFAULT NULL,
  `contacts_c_34` varchar(500) DEFAULT NULL,
  `contacts_c_35` varchar(500) DEFAULT NULL,
  `contacts_c_36` varchar(500) DEFAULT NULL,
  `contacts_c_37` varchar(500) DEFAULT NULL,
  `contacts_c_38` varchar(500) DEFAULT NULL,
  `contacts_c_39` varchar(500) DEFAULT NULL,
  `contacts_c_40` varchar(500) DEFAULT NULL,
  `contacts_c_41` varchar(500) DEFAULT NULL,
  `contacts_c_42` varchar(500) DEFAULT NULL,
  `contacts_c_43` varchar(500) DEFAULT NULL,
  `contacts_c_44` varchar(500) DEFAULT NULL,
  `contacts_c_45` varchar(500) DEFAULT NULL,
  `contacts_c_46` varchar(500) DEFAULT NULL,
  `contacts_c_47` varchar(500) DEFAULT NULL,
  `contacts_c_48` varchar(500) DEFAULT NULL,
  `contacts_c_49` varchar(500) DEFAULT NULL,
  `contacts_c_50` varchar(500) DEFAULT NULL,
  `contacts_i_1` varchar(100) DEFAULT NULL,
  `contacts_i_2` varchar(100) DEFAULT NULL,
  `contacts_i_3` varchar(100) DEFAULT NULL,
  `contacts_i_4` varchar(100) DEFAULT NULL,
  `contacts_i_5` varchar(100) DEFAULT NULL,
  `contacts_i_6` varchar(100) DEFAULT NULL,
  `contacts_i_7` varchar(100) DEFAULT NULL,
  `contacts_i_8` varchar(100) DEFAULT NULL,
  `contacts_i_9` varchar(100) DEFAULT NULL,
  `contacts_i_10` varchar(100) DEFAULT NULL,
  `contacts_i_11` varchar(100) DEFAULT NULL,
  `contacts_i_12` varchar(100) DEFAULT NULL,
  `contacts_i_13` varchar(100) DEFAULT NULL,
  `contacts_i_14` varchar(100) DEFAULT NULL,
  `contacts_i_15` varchar(100) DEFAULT NULL,
  PRIMARY KEY (`contact_id`),
  KEY `contact_campaign_id` (`contact_campaign_id`),
  KEY `contact_client_id` (`contact_client_id`),
  KEY `contact_purl2` (`contact_purl2`),
  KEY `contact_purl1` (`contact_purl1`),
  KEY `contact_purl` (`contact_purl`)
)
Community
  • 1
  • 1
Marty Thomas
  • 857
  • 2
  • 9
  • 18
  • 2
    Query plan and schema, please. –  Jun 29 '12 at 18:46
  • 2
    Quick tip: in MySQL, `GROUP BY` automatically applies `ORDER BY` because ... noone knows. To prevent this behavior to speed up your query a bit, add `ORDER BY NULL` to the grouped query. – biziclop Jun 29 '12 at 18:52
  • I added the table schema and query plan (explain) to the question... – Marty Thomas Jul 02 '12 at 17:22

2 Answers2

1

You could do it simply making a join on itself, like this:

SELECT DISTINCT c1.contact_id, c1.contact_firstName, c1.contact_lastName,
       RIGHT(c1.contact_lastName,1) AS nameNum
FROM
      contacts c1 INNER JOIN contacts c2
        ON c1.contact_firstName = c2.contact_firstName
       AND c1.contact_lastName  = c2.contact_lastName
       AND c2.contact_client_id = 1
       AND c1.contact_id <> c2.contact_id
ORDER BY c1.contact_id DESC
Francisco Spaeth
  • 23,493
  • 7
  • 67
  • 106
0

Would it make sense to keep track of the number of contacts for each client separately and store it in another column vs doing that sub-select query every time?

With that number of records, it might be more efficient to store some of those calculations instead of having to query them in real time.

Jason Siffring
  • 1,321
  • 1
  • 9
  • 7