2

I need to update about 6M rows in a table that has about 21M rows, and that is taking about 1 hour.

This is part of a data migration and I have full control over the database. Nobody else is using it, and I can stress it as much as I wish, I want to optimize for execution time.

The database is MySQL 5.7 in RDS, multicore machines.

The query is quite simple:

UPDATE invoice
    INNER JOIN first_deliveries
        ON invoice.job_id = first_deliveries.job_id
SET
    invoiceable_id = first_deliveries.id,
    invoiceable_type = 'Pooling::Delivery';

EXPLAIN is

mysql> explain update invoice inner join first_deliveries on invoice.job_id = first_deliveries.job_id set invoiceable_id = first_deliveries.id, invoiceable_type = 'Pooling::Delivery';
+----+-------------+------------------+------------+-------+----------------------------------+----------------------------------+---------+--------------------------------+---------+----------+-------------+
| id | select_type | table            | partitions | type  | possible_keys                    | key                              | key_len | ref                            | rows    | filtered | Extra       |
+----+-------------+------------------+------------+-------+----------------------------------+----------------------------------+---------+--------------------------------+---------+----------+-------------+
|  1 | SIMPLE      | first_deliveries | NULL       | index | idx_first_deliveries_job_id      | idx_first_deliveries_job_id      | 4       | NULL                           | 3363988 |   100.00 | Using index |
|  1 | UPDATE      | invoice          | NULL       | ref   | index_invoice_on_job_id_and_type | index_invoice_on_job_id_and_type | 5       | stuart.first_deliveries.job_id |       1 |   100.00 | NULL        |
+----+-------------+------------------+------------+-------+----------------------------------+----------------------------------+---------+--------------------------------+---------+----------+-------------+
2 rows in set (0.06 sec)

and the table definitions are:

CREATE TABLE `first_deliveries` (
`id` int(11) NOT NULL,
`job_id` int(11) NOT NULL,
`origin_place_id` int(11) DEFAULT NULL,
`original_eta_to_origin` datetime DEFAULT NULL,
`eta_to_origin` datetime DEFAULT NULL,
PRIMARY KEY (`id`),
KEY `idx_first_deliveries_origin_place_id` (`origin_place_id`),
KEY `idx_first_deliveries_job_id` (`job_id`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_unicode_ci

and

CREATE TABLE `invoice` (
`id` int(11) NOT NULL AUTO_INCREMENT,
`base_amount_cents` int(11) NOT NULL,
`job_id` int(11) DEFAULT NULL,
`invoiceable_id` int(11) DEFAULT NULL,
`invoiceable_type` varchar(191) COLLATE utf8mb4_unicode_ci DEFAULT NULL,
-- ... 60 other columns
PRIMARY KEY (`id`),
KEY `index_invoice_on_job_id_and_type` (`job_id`,`type`),
KEY `index_invoice_on_invoiceable_type_and_invoiceable_id` (`invoiceable_type`,`invoiceable_id`),
KEY `index_invoice_on_type_and_invoiceable_type_and_invoiceable_id` (`type`,`invoiceable_type`,`invoiceable_id`),
-- ... a bunch of other indexes for other columns
) ENGINE=InnoDB AUTO_INCREMENT=... DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_unicode_ci

I would expect one single SQL statement to be the most efficient approach because that is the most concise way to tell the server what I want to do, and I expect the server to do it in the most performant way possible. But maybe it needs to take compromises and I could help speeding it up somehow?

I tried to partition IDs in 10 disjoint sets and issue 10 disjoint UPDATEs in 10 threads, but got lock timeouts. Why did I get lock issues for disjoint row sets?

Any other technique?

Xavier Noria
  • 1,640
  • 1
  • 8
  • 10
  • You might want to consider SET GLOBAL innodb_change_buffer_max_size=50; prior to your massive update and then change it back to 25 or whatever you typically use on a normal day. SELECT @@innodb_change_buffer_max_size; will tell what the value is before you or after you change the value. 50 Change buffer will have a positive impact on long running duration. – Wilson Hauck Aug 16 '19 at 22:12

3 Answers3

0

There are only two viable execution plans for your update: either it goes through first_deliveries and updates the corresponding invoice, or it goes through invoice and checks if there is a corresponding entry in first_deliveries.

Which one is faster depends on the numbers, e.g. if first_deliveries only has one entry, the first path is clearly faster, but if you e.g. need to update every invoice, the second path is (less clearly, for technical reasons) significantly faster. MySQL decided on the first one due to the (expected) 3 million updates out of 21 million rows.

This may or may not be the correct decision, but you can test this by forcing MySQL to use the other path by using straight_join:

UPDATE invoice
STRAIGHT_JOIN first_deliveries
ON invoice.job_id = first_deliveries.job_id
SET invoiceable_id = first_deliveries.id,
    invoiceable_type = 'Pooling::Delivery';

which should give you an execution plan similar to

+-----+--------------+-------------------+-------+------+------------------------------+-----+
| id  | select_type  |      table        | type  | ...  |             key              | ... |
+-----+--------------+-------------------+-------+------+------------------------------+-----+
|  1  | UPDATE       | invoice           | ALL   | ...  |                              | ... |
|  1  | SIMPLE       | first_deliveries  | ref   | ...  | idx_first_deliveries_job_id  | ... |
+-----+--------------+-------------------+-------+------+------------------------------+-----+

My guess is that this could be faster, but you will only know if you run it with your data.

Using straight_join prevents MySQL from choosing any other, potentially better execution plan (e.g. when you run this query regularly, with potentially smaller tables). Since this seem to be a one-time migration, you probably don't have to worry about other scenarios though.

Apart from that, a significant bottleneck is the update itself, e.g. writing to the disk, no matter which way MySQL took to find the row to update. The 2nd execution plan should be a bit faster in that regard, but overall, this might be what is limiting your execution speed. You can try to increase your buffers to reduce disk writes (and you should revert the changes afterwards).

For a similar reason, splittig up your job into several parallel tasks isn't usually useful, as the bottleneck will still be the disk access. A different approach though might be to split the tasks into smaller batches that you ran after each other (e.g. invoice id 1 to 100k, 100k to 200k, ...), as e.g. the logs stay smaller. Since this is basically linear, it also allows you to extrapolate how long 21 million rows will take after running the first 5 batches or so.

Solarflare
  • 10,721
  • 2
  • 18
  • 35
0

This is a one-time task? By the time we can provide a faster way to do it, the original slow way will be finished. But, I will discuss it anyway.

Updating all 21M rows in a single transaction is costly because of needing to build (and eventually toss) the undo information. Breaking the task into smaller transactions (1K rows each) may be noticeably more efficient.

Chunk by the PK. Details here .

Multi-threading should help some, but don't have more threads than CPUs. Even at that point, you will probably be I/O-bound.

21M rows with invoiceable_type varchar(191) COLLATE utf8mb4_unicode_ci takes a lot of space. That (and probably other columns) should be 'normalized'. That would decrease the average from 17? bytes to 1?. That's a few GB of disk space.

The locks you got may be from locks at the boundaries between otherwise disjoint sets. Also, using only 10 sets implies that the Optimizer may have chosen to do table scans rather than use an index. I would go with at least 30 sets. I suggest chunking at 1K rows, since anything beyond that is mostly into "diminishing returns".

Rick James
  • 135,179
  • 13
  • 127
  • 222
0

@WilsonHauck This project has been going through a lot of optimizations. There are several tables to be migrated and many more millions of records, I have experimented with many things like different buffer sizes, using MEMORY engines, etc. I have benchmarks in place and they didn't work for my use case.

@RickJames This particular UPDATE was the last statement that resisted optimization. Indeed, working on small ranges has been the key, I got a 5x speed up in my test environment for this particular statement. The compromise has been ranges of 5K and a pool of 20 threads (there are other threads doing other work in parallel). The test machine has 8 cores, but the production machine has 48, so I expect the speed up to be even greater.

I would like to understand the lock errors I was getting when ranges were on the order of hundreds of thousands (I mean, to actually know what they were rather than conjecture, and so to understand why they are not present in small ranges), and also to understand why I need to hand-code a more performant version of the update.

But that is just to better understand the details, this 5x speed up is incredible and enough for my purposes.

BTW, I believe an I/O bound task can precisely use more threads than cores, because you have wait times in the CPU that other threads can leverage. It is for CPU-bound tasks that you won't squeeze more performance with more threads.

@Solarflare since the multithread approach is what I was looking for, I didn't experiement with the STRAIGHT JOIN, but in the new approach cardinalities are reversed, and MySQL starts with invoice now. Perhaps we got an extra boost also from starting there, as per your remark.

Xavier Noria
  • 1,640
  • 1
  • 8
  • 10
  • Additional information request. Post on pastebin.com and share the links. Text results of: B) SHOW GLOBAL STATUS; after minimum 24 hours UPTIME C) SHOW GLOBAL VARIABLES; D) SHOW FULL PROCESSLIST; E) complete MySQLTuner report AND Optional very helpful information, if available includes - htop OR top OR mytop for most active apps, ulimit -a for a linux/unix list of limits, iostat -xm 5 3 for IOPS by device and core/cpu count, free -h for a linux/unix Total Used Free Mem: and Swap:, for server workload tuning analysis to find MORE speed for you. – Wilson Hauck Aug 20 '19 at 12:30