3

I am trying to number some records in MySQL (5.5.44-0 on Ubuntu), grouped by another column (you will see what I mean below). I am adapting the solution described at Running Sums for Multiple Categories in MySQL, except I'm just numbering, not summing.

The tables involved are quite huge with almost 100 columns, so let's simplify the demonstration first by creating derived tables with only the important columns involved. Apologies for not sharing a SQL Fiddle, because it doesn't look like it's replicable unless done with big volume of data, which I cannot share:

Creating the tables:

CREATE TABLE `inquiries_test` (
  `id` int(11) NOT NULL DEFAULT '0',
  `motive` varchar(255) CHARACTER SET utf8 DEFAULT NULL,
  PRIMARY KEY (`inquiry_id`),
  KEY `motive` (`motive`)
);

insert into inquires_test select id, motive from inquiries;

CREATE TABLE `leads_test` (
  `lead_id` int(11) DEFAULT NULL,
  `created_at` datetime DEFAULT NULL,
  `inquiry_id` int(11) DEFAULT NULL,
  KEY `id` (`lead_id`)
);

insert into leads_test select lead_id, created_at, inquiry_id;

CREATE TABLE `lead_inserts` (
  `lead_id` int(11) DEFAULT NULL,
  `created_at` datetime DEFAULT NULL,
  `cnt` int(11) DEFAULT NULL
);

You will notice above that data from inquiries_test and leads_test came from the actual production tables. The importance of that will come to play later. Now populating lead_inserts:

playground>insert into lead_inserts (cnt, created_at, lead_id) 
    -> SELECT @cnt := if(@id = l.lead_id,@cnt,0) + 1 as cnt 
    -> , l.created_at 
    -> , @id := l.lead_id as local_resouce_id
    -> FROM leads_test l join inquiries_test i on (l.inquiry_id=i.id)
    -> CROSS JOIN (select @id := 0, @cnt := 0) as InitVarsAlias 
    -> where i.motive='real' ORDER BY lead_id, created_at;
Query OK, 2172774 rows affected (14.30 sec)
Records: 2172774  Duplicates: 0  Warnings: 0

playground>select * from lead_inserts where lead_id in (117,118);
+---------+---------------------+------+
| lead_id | created_at          | cnt  |
+---------+---------------------+------+
|     117 | 2012-06-23 00:13:09 |    1 |
|     117 | 2014-09-14 04:30:37 |    2 |
|     117 | 2015-01-27 22:34:41 |    3 |
|     117 | 2015-03-19 19:33:51 |    4 |
|     118 | 2014-12-24 17:47:15 |    1 |
|     118 | 2015-01-23 21:30:09 |    2 |
|     118 | 2015-04-07 21:33:43 |    3 |
|     118 | 2015-04-10 17:00:04 |    4 |
|     118 | 2015-05-12 21:59:49 |    5 |
+---------+---------------------+------+

So far so good - the value of cnt "resets" for every new lead_id. Now given that leads_test and inquiries_tests are basically leads and inquiries with the other columns removed, it makes sense to expect that if I modify the insert statement to use the original tables, the result should be the same, right? But look:

playground>truncate table lead_inserts;
Query OK, 0 rows affected (0.14 sec)

playground>insert into lead_inserts (cnt, created_at, lead_id) 
    -> SELECT @cnt := if(@id = l.lead_id,@cnt,0) + 1 as cnt 
    -> , l.created_at 
    -> , @id := l.lead_id as local_resouce_id
    -> FROM leads l join inquiries i on (l.inquiry_id=i.id)        
    -> CROSS JOIN (select @id := 0, @cnt := 0) as InitVarsAlias 
    -> where i.motive='real' ORDER BY lead_id, created_at;
Query OK, 2172774 rows affected (17.25 sec)
Records: 2172774  Duplicates: 0  Warnings: 0

playground>select * from lead_inserts where lead_id in (117,118);
+---------+---------------------+------+
| lead_id | created_at          | cnt  |
+---------+---------------------+------+
|     117 | 2012-06-23 00:13:09 |    1 |
|     117 | 2014-09-14 04:30:37 |    1 |
|     117 | 2015-01-27 22:34:41 |    1 |
|     117 | 2015-03-19 19:33:51 |    1 |
|     118 | 2014-12-24 17:47:15 |    1 |
|     118 | 2015-01-23 21:30:09 |    1 |
|     118 | 2015-04-07 21:33:43 |    1 |
|     118 | 2015-04-10 17:00:04 |    1 |
|     118 | 2015-05-12 21:59:49 |    1 |
+---------+---------------------+------+

What happened to the numbering? Other observations while using the original tables:

  1. If I do not process all records and specify only a few lead_id's, the computation comes out right.
  2. If I remove the INSERT clause and just run it as a select (with a LIMIT clause to just show 50 rows output), the computation comes out right.

So, is this a bug I hit, or am I missing something? In real life I cannot use the process above as a workaround - I really have to use leads and inquiries because there are other columns from those tables that have to be part of lead_inserts.

Thanks!

Community
  • 1
  • 1
Johann Tagle
  • 106
  • 6
  • 1
    What I think is happening is MySQL optimizes your query. It sees no reason to ORDER BY your records as there is no AUTO-INCREMENT column in the destination table. There is also no clustered indexes. MySQL cannot analyse your counter variables and therefore believes that the ORDER BY is not required. I think you either need to use a subquery: `insert into lead_inserts (cnt, created_at, lead_id ) SELECT * FROM (SELECT @cnt := if(@id = l.lead_id,@cnt,0) + 1 as cnt ....) as a` or add a auto-increment column to `lead_inserts` – cha Aug 24 '15 at 06:20
  • Neither one worked :( and also, unless I'm missing something, your proposed solutions don't explain why my statements work for the derived test tables and not the live ones. But thanks for trying to help! – Johann Tagle Aug 24 '15 at 06:56
  • You are right - it's an optimization thing. Adding a FORCE INDEX FOR ORDER BY clause seems to have done the trick. Though your proposed solutions didn't work, the initial comment about optimization pointed me to the right direction. Thanks! – Johann Tagle Aug 24 '15 at 07:12

1 Answers1

0

A Cha pointed out, it looks like it's a MySQL optimization thing where MySQL finds no reason to do an ORDER BY when the end result will just be inserted to a new table. Why it works for the test tables and not for the production ones, when they have the same number of rows, I don't know. But this is how I forced it to sort what will be inserted:

First make sure that there is a concatenated index for the columns I will order by:

CREATE INDEX idx_leads_lead_id_created ON leads(lead_id, created_at);

Then force MySQL to use this index:

insert into lead_inserts (cnt, created_at, lead_id) 
SELECT @cnt := if(@id = l.lead_id,@cnt,0) + 1 as cnt 
, l.created_at 
@id := l.lead_id as local_resouce_id
FROM leads l FORCE INDEX FOR ORDER BY (idx_leads_lead_id_created)
JOIN inquiries i on (l.inquiry_id=i.id)        
CROSS JOIN (select @id := 0, @cnt := 0) as InitVarsAlias 
WHERE i.motive='real' 
ORDER BY lead_id, created_at;
Johann Tagle
  • 106
  • 6