4

I have the following data :

customer
+---------------------------------------------------------------------------------------------+
| id    | email             | firstname     | lastname      | no_pub    | customer_address    |
|---------------------------------------------------------------------------------------------|
| 1     | martin@domain.com | Martin        | Scorcese      | 0         | 4                   |
| 2     | robert@domain.com | Robert        | De Niro       | 0         | 7                   |
| 3     | bruce@domain;com  | Bruce         | Willis        | 0         | 10                  |
+---------------------------------------------------------------------------------------------+

address
+------------------------------------------+
| id    | city             | zipcode       |
|------------------------------------------|
| 4     | Paris            | 75001         |
| 7     | Marseille        | 13000         |
| 10    | Bordeaux         | 33000         |
+------------------------------------------+

sf_geo_cities
+------------------------------------------+
| id    | region_id    | zipcode           |
|------------------------------------------|
| 1     | 1            | 75001             |
| 2     | 2            | 13000             |
| 2     | 3            | 33000             |
+------------------------------------------+

sf_geo_regions
+------------------------------------------+
| id    | name             | zipcode       |
|------------------------------------------|
| 1     | Ile-de-France    | 75001         |
| 2     | Cote d'Azur      | 13000         |
| 2     | Gironde          | 33000         |
+------------------------------------------+

This is a sample, there are obviously way more data in those tables (300k+ in customer, 400k+ in address) At the end, i'd like to get the following data :

+---------------------------------------------------------------------------------------------+
| id    | email             | firstname     | lastname      | city          | region          |
|---------------------------------------------------------------------------------------------|
| 1     | martin@domain.com | Martin        | Scorcese      | Paris         | Ile-de-France   |
| 2     | robert@domain.com | Robert        | De Niro       | Marseille     | Cote d'Azur     |
| 3     | bruce@domain;com  | Bruce         | Willis        | NULL          | NULL            |
+---------------------------------------------------------------------------------------------+

I tried the following SQL query :

SELECT c.id, c.email, c.firstname, c.lastname, gc.name, gr.name
FROM customer c
LEFT JOIN address ad ON ad.id = c.customer_address
JOIN sf_geo_cities gc ON gc.zipcode = ad.zipcode
JOIN sf_geo_regions gr ON gr.id = gc.region_id
WHERE no_pub = 0

But it takes ages, I have to shutdown mysql service. There is probably a problem with joins multiplying rows.

Is there a simple query to fetch the data I expect ?

UPDATE #1 : Here is the EXPLAIN asked in a comment below :

id  select_type     table   type    possible_keys                                           key         key_len     ref                         rows    Extra   
1   SIMPLE          c       ref     PRIMARY,UNIQ_81398E097D3656A4,UNIQ_81398E09E7927C7...   no_pub      1           const                       136220  Using where
1   SIMPLE          a       eq_ref  PRIMARY                                                 PRIMARY     8           evotest.c.account           1       Using index
1   SIMPLE          ad      eq_ref  PRIMARY                                                 PRIMARY     8           evotest.c.customer_address  1       NULL
1   SIMPLE          gc      ALL     zipcode                                                 NULL        NULL        NULL                        38194   Range checked for each record (index map: 0x8)
1   SIMPLE          gr      eq_ref  PRIMARY                                                 PRIMARY     4           evotest.gc.region_id        1       NULL

UPDATE #2 : Database sample Here is my database with minimal sample data.

--
-- Structure de la table `account`
--

CREATE TABLE IF NOT EXISTS `account` (
`id` bigint(20) unsigned NOT NULL,
  `identifier` varchar(255) COLLATE utf8_bin NOT NULL
) ENGINE=InnoDB  DEFAULT CHARSET=utf8 COLLATE=utf8_bin AUTO_INCREMENT=315688 ;

--
-- Structure de la table `customer`
--

CREATE TABLE IF NOT EXISTS `customer` (
`id` bigint(20) unsigned NOT NULL,
  `account` bigint(20) unsigned NOT NULL,
  `customer_address` bigint(20) unsigned DEFAULT NULL,
  `email` varchar(255) COLLATE utf8_bin NOT NULL,
  `lastname` varchar(255) COLLATE utf8_bin NOT NULL,
  `firstname` varchar(255) COLLATE utf8_bin NOT NULL,
  `no_pub` tinyint(1) NOT NULL DEFAULT '0'
) ENGINE=InnoDB  DEFAULT CHARSET=utf8 COLLATE=utf8_bin AUTO_INCREMENT=315224 ;

--
-- Structure de la table `address`
--

CREATE TABLE IF NOT EXISTS `address` (
`id` bigint(20) unsigned NOT NULL,
  `city` varchar(64) COLLATE utf8_bin DEFAULT NULL,
  `street` varchar(255) COLLATE utf8_bin DEFAULT NULL,
  `complement` varchar(128) COLLATE utf8_bin DEFAULT NULL,
  `zipcode` varchar(16) COLLATE utf8_bin DEFAULT NULL,
  `country_id` int(11) DEFAULT NULL,
  `cedex` tinyint(1) NOT NULL DEFAULT '0',
  `abroad` tinyint(1) NOT NULL DEFAULT '0'
) ENGINE=InnoDB  DEFAULT CHARSET=utf8 COLLATE=utf8_bin AUTO_INCREMENT=442743 ;

--
-- Structure de la table `sf_geo_cities`
--

CREATE TABLE IF NOT EXISTS `sf_geo_cities` (
`id` int(11) NOT NULL,
  `region_id` int(11) NOT NULL,
  `name` varchar(255) NOT NULL,
  `slug` varchar(255) NOT NULL,
  `zipcode` varchar(5) NOT NULL,
  `insee_code` int(11) NOT NULL,
  `latitude` float NOT NULL,
  `longitude` float NOT NULL
) ENGINE=InnoDB  DEFAULT CHARSET=utf8 AUTO_INCREMENT=38106 ;

CREATE TABLE IF NOT EXISTS `sf_geo_regions` (
`id` int(11) NOT NULL,
  `name` varchar(255) NOT NULL
) ENGINE=InnoDB  DEFAULT CHARSET=utf8 AUTO_INCREMENT=25 ;

CREATE TABLE IF NOT EXISTS `sf_geo_countries` (
`id` int(11) NOT NULL,
  `code` int(11) NOT NULL,
  `alpha2` varchar(2) NOT NULL,
  `alpha3` varchar(3) NOT NULL,
  `name_en` varchar(45) NOT NULL,
  `name_fr` varchar(45) NOT NULL,
  `is_default` tinyint(1) NOT NULL
) ENGINE=InnoDB  DEFAULT CHARSET=utf8 AUTO_INCREMENT=242 ;

CREATE TABLE IF NOT EXISTS `sf_user_data` (
`id` int(11) NOT NULL,
  `user_id` bigint(20) unsigned NOT NULL,
  `main_activity_type_id` int(11) DEFAULT NULL,
  `main_activity_id` int(11) DEFAULT NULL
) ENGINE=InnoDB  DEFAULT CHARSET=utf8 COLLATE=utf8_unicode_ci AUTO_INCREMENT=19001 ;


CREATE TABLE IF NOT EXISTS `sf_activity_types` (
`id` int(11) NOT NULL,
  `name` varchar(255) COLLATE utf8_unicode_ci NOT NULL,
  `identifier` varchar(255) COLLATE utf8_unicode_ci NOT NULL COMMENT 'Ne doit pas être modifié, il s agit de la clé dans le tableau de constantes $constants stockant les taux dans l entité Calculator'
) ENGINE=InnoDB  DEFAULT CHARSET=utf8 COLLATE=utf8_unicode_ci AUTO_INCREMENT=3 ;

CREATE TABLE IF NOT EXISTS `sf_activities` (
`id` int(11) NOT NULL,
  `name` varchar(255) COLLATE utf8_unicode_ci NOT NULL,
  `asks_for_custom` tinyint(1) NOT NULL DEFAULT '0'
) ENGINE=InnoDB  DEFAULT CHARSET=utf8 COLLATE=utf8_unicode_ci AUTO_INCREMENT=27 ;

--
-- Index pour les tables exportées
--

ALTER TABLE `account`
 ADD PRIMARY KEY (`id`), ADD UNIQUE KEY `identifier_UNIQUE` (`identifier`);

ALTER TABLE `customer`
 ADD PRIMARY KEY (`id`), ADD UNIQUE KEY `UNIQ_81398E097D3656A4` (`account`), ADD UNIQUE KEY `UNIQ_81398E09E7927C74` (`email`), ADD UNIQUE KEY `UNIQ_81398E091193CB3F` (`customer_address`), ADD KEY `no_pub` (`no_pub`);

ALTER TABLE `address`
 ADD PRIMARY KEY (`id`), ADD KEY `IDX_D4E6F81F92F3E70` (`country_id`), ADD KEY `zipcode` (`zipcode`); 

ALTER TABLE `sf_geo_cities`
 ADD PRIMARY KEY (`id`), ADD KEY `IDX_B56556A198260155` (`region_id`), ADD KEY `zipcode` (`zipcode`);

ALTER TABLE `sf_geo_regions`
 ADD PRIMARY KEY (`id`);

ALTER TABLE `sf_geo_countries`
 ADD PRIMARY KEY (`id`), ADD KEY `IDX_F86325E277153098` (`code`), ADD KEY `IDX_F86325E2B762D672` (`alpha2`), ADD KEY `IDX_F86325E2C065E6E4` (`alpha3`);

ALTER TABLE `sf_user_data`
 ADD PRIMARY KEY (`id`), ADD UNIQUE KEY `UNIQ_E904BFD1A76ED395` (`user_id`), ADD KEY `IDX_E904BFD12E864BE8` (`main_activity_type_id`), ADD KEY `IDX_E904BFD15543A800` (`main_activity_id`);

ALTER TABLE `sf_activity_types`
 ADD PRIMARY KEY (`id`);

ALTER TABLE `sf_activities`
 ADD PRIMARY KEY (`id`); 

--
-- Contraintes pour les tables exportées
--

ALTER TABLE `customer`
ADD CONSTRAINT `FK_81398E091193CB3F` FOREIGN KEY (`customer_address`) REFERENCES `address` (`id`),
ADD CONSTRAINT `FK_81398E097D3656A4` FOREIGN KEY (`account`) REFERENCES `account` (`id`);

ALTER TABLE `address`
ADD CONSTRAINT `FK_D4E6F81F92F3E70` FOREIGN KEY (`country_id`) REFERENCES `sf_geo_countries` (`id`);

ALTER TABLE `sf_geo_cities`
ADD CONSTRAINT `FK_B56556A198260155` FOREIGN KEY (`region_id`) REFERENCES `sf_geo_regions` (`id`);

ALTER TABLE `sf_user_data`
ADD CONSTRAINT `FK_E904BFD12E864BE8` FOREIGN KEY (`main_activity_type_id`) REFERENCES `sf_activity_types` (`id`),
ADD CONSTRAINT `FK_E904BFD15543A800` FOREIGN KEY (`main_activity_id`) REFERENCES `sf_activities` (`id`),
ADD CONSTRAINT `FK_E904BFD1A76ED395` FOREIGN KEY (`user_id`) REFERENCES `account` (`id`);


INSERT INTO `account` (`id`, `identifier`) VALUES ('1', 'martin@domain.com'), ('2', 'robert@domain.com'), ('3', 'bruce@domain.com');

INSERT INTO `sf_geo_countries` (`id`, `code`, `alpha2`, `alpha3`, `name_en`, `name_fr`, `is_default`) VALUES ('1', '1', 'FR', 'FRA', 'France', 'France', '1');

INSERT INTO `address` (`id`, `city`, `street`, `complement`, `zipcode`, `country_id`, `cedex`, `abroad`) VALUES ('1', 'Paris', '1 rue de Paris', NULL, '75001', '1', '0', '0'), ('2', 'Marseille', '1 rue de Marseille', NULL, '13000', '1', '0', '0');

INSERT INTO `customer` (`id`, `account`, `customer_address`, `email`, `lastname`, `firstname`, `no_pub`) VALUES ('1', '1', '1', 'martin@domain.com', 'Scorcese', 'Martin', '0'), ('2', '2', '2', 'robert@domain.com', 'De Niro', 'Robert', '0'), ('3', '3', NULL, 'bruce@domain.com', 'Willis', 'Bruce', '0');

INSERT INTO `sf_activities` (`id`, `name`, `asks_for_custom`) VALUES ('1', 'Activity #1', '0'), ('2', 'Activity #2', '0');

INSERT INTO `sf_activity_types` (`id`, `name`, `identifier`) VALUES ('1', 'Activity Type #1', 'activity-type-1'), ('2', 'Activity Type #2', 'activity-type-2');

INSERT INTO `sf_geo_regions` (`id`, `name`) VALUES ('1', 'Ile-de-France'), ('2', 'Cote d''Azur');

INSERT INTO `sf_geo_cities` (`id`, `region_id`, `name`, `slug`, `zipcode`, `insee_code`, `latitude`, `longitude`) VALUES ('1', '1', 'Paris', 'paris', '75001', '1', '0', '0'), ('2', '2', 'Marseille', 'marseille', '13000', '2', '0', '0');

INSERT INTO `sf_user_data` (`id`, `user_id`, `main_activity_type_id`, `main_activity_id`) VALUES ('1', '1', '1', '1'), ('2', '3', '2', '2');

With this minimal amount of data, I can run the following query that returns me everything I want. But this query runs for ages in my real database with 300k+ customers and 400k+ addresses

SELECT c.id, c.email, c.firstname, c.lastname, acttypes.name AS activity_type, act.name AS activity, gc.name AS city, gr.name AS region
FROM customer c
JOIN account a ON a.id = c.account
LEFT JOIN sf_user_data ud ON ud.user_id = a.id
LEFT JOIN sf_activity_types acttypes ON acttypes.id = ud.main_activity_type_id
LEFT JOIN sf_activities act ON act.id = ud.main_activity_id
LEFT JOIN address ad ON ad.id = c.customer_address
LEFT JOIN sf_geo_cities gc ON gc.zipcode = ad.zipcode
LEFT JOIN sf_geo_regions gr ON gr.id = gc.region_id
WHERE no_pub = 0

UPDATE #3 : tracking joins issue

This query runs fast :

SELECT 
    c.id 
    ,c.email 
    ,c.firstname 
    ,c.lastname 
    ,acttypes.name AS activity_type 
    ,act.name AS activity 
    ,ad.zipcode AS address_zipcode
--    ,gc.name AS city 
--    ,gr.name AS region    
FROM  
    customer c 
JOIN account a ON a.id = c.account
    LEFT JOIN sf_user_data ud ON ud.user_id = a.id
    LEFT JOIN sf_activity_types acttypes ON acttypes.id = ud.main_activity_type_id
    LEFT JOIN sf_activities act ON act.id = ud.main_activity_id
    LEFT JOIN address ad ON ad.id = c.customer_address
--    LEFT JOIN sf_geo_cities gc ON gc.zipcode = ad.zipcode
--    LEFT JOIN sf_geo_regions gr ON gr.id = gc.region_id
WHERE 
    no_pub = 0 

This query runs fast too :

SELECT 
    c.id 
    ,c.email 
    ,c.firstname 
    ,c.lastname 
--    ,acttypes.name AS activity_type 
--    ,act.name AS activity 
    ,ad.zipcode AS address_zipcode
    ,gc.name AS city 
--    ,gr.name AS region    
FROM  
    customer c 
JOIN account a ON a.id = c.account
--    LEFT JOIN sf_user_data ud ON ud.user_id = a.id
--    LEFT JOIN sf_activity_types acttypes ON acttypes.id = ud.main_activity_type_id
--    LEFT JOIN sf_activities act ON act.id = ud.main_activity_id
    LEFT JOIN address ad ON ad.id = c.customer_address
    LEFT JOIN sf_geo_cities gc ON gc.zipcode = ad.zipcode
--    LEFT JOIN sf_geo_regions gr ON gr.id = gc.region_id
WHERE 
    no_pub = 0 

This query runs forever :

SELECT 
    c.id 
    ,c.email 
    ,c.firstname 
    ,c.lastname 
    ,acttypes.name AS activity_type 
    ,act.name AS activity 
    ,ad.zipcode AS address_zipcode
    ,gc.name AS city 
--    ,gr.name AS region    
FROM  
    customer c 
JOIN account a ON a.id = c.account
    LEFT JOIN sf_user_data ud ON ud.user_id = a.id
    LEFT JOIN sf_activity_types acttypes ON acttypes.id = ud.main_activity_type_id
    LEFT JOIN sf_activities act ON act.id = ud.main_activity_id
    LEFT JOIN address ad ON ad.id = c.customer_address
    LEFT JOIN sf_geo_cities gc ON gc.zipcode = ad.zipcode
--    LEFT JOIN sf_geo_regions gr ON gr.id = gc.region_id
WHERE 
    no_pub = 0

Joins on activities are ok, joins on addresses are ok, but trying to use all these joins together doesn't work.

VaN
  • 2,180
  • 4
  • 19
  • 43
  • Have you added indexies to your tables? – Jens Jul 28 '15 at 10:43
  • What indexes do you have? – Kickstart Jul 28 '15 at 10:43
  • Show `create table` syntax. Show output of `explain`. – Ulrich Thomas Gabor Jul 28 '15 at 10:43
  • 1
    there are indexes on customer.customer_address and sf_geo_cities.region_id. Should I add an index on address.zipcode and sf_geo_cities.zipcode ? – VaN Jul 28 '15 at 10:48
  • One on sf_geo_cities.zipcode would likely help (probably along with one on no_pub on the customer table). But you appear to be doing INNER JOINs against the result of a LEFT OUTER JOIN, which is probably negating the LEFT OUTER JOIN. – Kickstart Jul 28 '15 at 10:52
  • ok, I added indexes on address.zipcode, sf_geo_cities.zipcode and customer.no_pub, and made all my joins LEFT JOIN. but my query still runs forever. Is it something normal with 300k+ customers, 400k+ addresses and 90k+ cities ? Any solution to make it work ? – VaN Jul 28 '15 at 11:36
  • Can you do an explain on it and post the results of that? – Kickstart Jul 28 '15 at 12:03
  • @Kickstart I edited the oiriginal post with the EXPLAIN. If I add a LIMIT 30 on the query, it runs really fast. If I ommit the LIMIT, it runs for ages.. – VaN Jul 28 '15 at 12:24
  • The biggy appears to be that it has ignored the index on sf_geo_cities.zipcode . Not sure why, but maybe different collation, or it is a character field on one table and a numeric on the other. – Kickstart Jul 28 '15 at 12:38
  • i just checked. both tables are innoDB, both fields are varchar : one is varchar(5) utf8_general_ci, the other is varchar(16) utf8_bin. – VaN Jul 28 '15 at 16:05
  • Strange. Can you add the table declares to the question and I will try and have a play tomorrow. – Kickstart Jul 28 '15 at 16:44
  • @Kickstart Updated my original post with a database structure and minimal sample data. – VaN Jul 29 '15 at 08:09
  • How many records in `Customer` have `no_pub = 0`? If your are returning a large number of records then that could be causing the delay. The network transfer of a lot of data could be adding to the time – Tobsey Jul 29 '15 at 08:17
  • Looking at that I can't see anything obvious. It might be a problem with your actual data (and it might be worth running ANALYZE TABLE .... on each of your tables) that is confusing MySQL. With the test data 2 of the joins do not use indexes, but the amount of rows they are joining is so small that the indexes would probably be worthless. – Kickstart Jul 29 '15 at 09:48

1 Answers1

1

As mentioned in my comment a very important point is how many records are returned. Too many records in one statement could be causing the problem and you may be better off if you page the results, selecting n records at a time.

To troubleshoot the query, build it up join by join to try and determine when the performance falls of the cliff. Start with:

SELECT 
    c.id 
    ,c.email 
    ,c.firstname 
    ,c.lastname 
--    ,acttypes.name AS activity_type 
--    ,act.name AS activity 
--    ,gc.name AS city 
--    ,gr.name AS region    
FROM  
    customer c 
--    JOIN account a ON a.id = c.account
--    LEFT JOIN sf_user_data ud ON ud.user_id = a.id
--    LEFT JOIN sf_activity_types acttypes ON acttypes.id = ud.main_activity_type_id
--    LEFT JOIN sf_activities act ON act.id = ud.main_activity_id
--    LEFT JOIN address ad ON ad.id = c.customer_address
--    LEFT JOIN sf_geo_cities gc ON gc.zipcode = ad.zipcode
--    LEFT JOIN sf_geo_regions gr ON gr.id = gc.region_id
WHERE 
    no_pub = 0

Then un-comment one join at a time until your query performs very slowly. Analyse that query and see if it can be improved with indexes. Then continue un-commenting the joins until you have your full query.

Tobsey
  • 3,390
  • 14
  • 24
  • Trying to limit the joins seems a worthwhile idea. Hopefully will help track down which join is causing the problems. – Kickstart Jul 29 '15 at 09:48
  • To answer your comment, there are 285972 customers with no_pub = 0. I'll try to do your way by uncommenting joins 1 at a time and track performance issue. Will come back here to post what I found. – VaN Jul 29 '15 at 11:04
  • just updated my original post with UPDATE #3 part. joins on activities are ok, joins on addresses are ok, using them both doesn't work. – VaN Jul 29 '15 at 11:20
  • amazing, but re-running the running-forever query now ,works perfectly.. Does running (forever, had to kill mysql) it the first time updated some kind of indexes hidden table ? and made it run fast on second run ? – VaN Jul 29 '15 at 11:23
  • Does sf_geo_cities have multiple entries for each zip code? Similarly can a customer have multiple activities? If so then the number of returned rows could grow dramatically. – Kickstart Jul 29 '15 at 11:26
  • yes, sf_geo_cities have multiple entries for each zipcode. But a customer have 0 or 1 activity. I just added a GROUP BY c.id at the end of the query. That means I mlay have some innacurate data about the city, but it's not very important, cities with common zipcodes are usually very close to each others. – VaN Jul 29 '15 at 11:31