I need to update about 6M rows in a table that has about 21M rows, and that is taking about 1 hour.
This is part of a data migration and I have full control over the database. Nobody else is using it, and I can stress it as much as I wish, I want to optimize for execution time.
The database is MySQL 5.7 in RDS, multicore machines.
The query is quite simple:
UPDATE invoice
INNER JOIN first_deliveries
ON invoice.job_id = first_deliveries.job_id
SET
invoiceable_id = first_deliveries.id,
invoiceable_type = 'Pooling::Delivery';
EXPLAIN is
mysql> explain update invoice inner join first_deliveries on invoice.job_id = first_deliveries.job_id set invoiceable_id = first_deliveries.id, invoiceable_type = 'Pooling::Delivery';
+----+-------------+------------------+------------+-------+----------------------------------+----------------------------------+---------+--------------------------------+---------+----------+-------------+
| id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
+----+-------------+------------------+------------+-------+----------------------------------+----------------------------------+---------+--------------------------------+---------+----------+-------------+
| 1 | SIMPLE | first_deliveries | NULL | index | idx_first_deliveries_job_id | idx_first_deliveries_job_id | 4 | NULL | 3363988 | 100.00 | Using index |
| 1 | UPDATE | invoice | NULL | ref | index_invoice_on_job_id_and_type | index_invoice_on_job_id_and_type | 5 | stuart.first_deliveries.job_id | 1 | 100.00 | NULL |
+----+-------------+------------------+------------+-------+----------------------------------+----------------------------------+---------+--------------------------------+---------+----------+-------------+
2 rows in set (0.06 sec)
and the table definitions are:
CREATE TABLE `first_deliveries` (
`id` int(11) NOT NULL,
`job_id` int(11) NOT NULL,
`origin_place_id` int(11) DEFAULT NULL,
`original_eta_to_origin` datetime DEFAULT NULL,
`eta_to_origin` datetime DEFAULT NULL,
PRIMARY KEY (`id`),
KEY `idx_first_deliveries_origin_place_id` (`origin_place_id`),
KEY `idx_first_deliveries_job_id` (`job_id`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_unicode_ci
and
CREATE TABLE `invoice` (
`id` int(11) NOT NULL AUTO_INCREMENT,
`base_amount_cents` int(11) NOT NULL,
`job_id` int(11) DEFAULT NULL,
`invoiceable_id` int(11) DEFAULT NULL,
`invoiceable_type` varchar(191) COLLATE utf8mb4_unicode_ci DEFAULT NULL,
-- ... 60 other columns
PRIMARY KEY (`id`),
KEY `index_invoice_on_job_id_and_type` (`job_id`,`type`),
KEY `index_invoice_on_invoiceable_type_and_invoiceable_id` (`invoiceable_type`,`invoiceable_id`),
KEY `index_invoice_on_type_and_invoiceable_type_and_invoiceable_id` (`type`,`invoiceable_type`,`invoiceable_id`),
-- ... a bunch of other indexes for other columns
) ENGINE=InnoDB AUTO_INCREMENT=... DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_unicode_ci
I would expect one single SQL statement to be the most efficient approach because that is the most concise way to tell the server what I want to do, and I expect the server to do it in the most performant way possible. But maybe it needs to take compromises and I could help speeding it up somehow?
I tried to partition IDs in 10 disjoint sets and issue 10 disjoint UPDATEs in 10 threads, but got lock timeouts. Why did I get lock issues for disjoint row sets?
Any other technique?