Why is MySQL writing 5 times more data to disk, than is committed to the database?

Question

I have installed MySQL 8.0.25 on top of Ubuntu 20.04, running on the C5.2xlarge instance.

Then I ran a script that fills 10 tables with data. The test took exactly 2 hours, during which is has created 123146.5MB of data:

That means that on average, 17.1MB/s were written to the database. However, atop is reporting something weird: while it shows disk activity around 18-19MB/s, it also shows that the process mysqld wrote 864MB in the 10 second sample - which translates to 86.4MB/s, about 5 times as much as the amount of data actually committed to the database:

Why such disrepancy?

iotop is also typically showing that MySQL is writing 5x:

Same for pidstat:

I also tried to use pt-diskstats from the Percona toolkit, but it didn't show anything...

I also reproduced the issue on RDS. In both cases (EC2 and RDS), the Cloudwatch statistics also show 5x writes...

The database has 10 tables that were filled. 5 of them have this definition:

CREATE TABLE `shark` (
  `height` int DEFAULT NULL,
  `weight` int DEFAULT NULL,
  `name` mediumtext,
  `shark_id` bigint NOT NULL,
  `owner_id` bigint DEFAULT NULL,
  PRIMARY KEY (`shark_id`),
  KEY `owner_id` (`owner_id`),
  CONSTRAINT `shark_ibfk_1` FOREIGN KEY (`owner_id`) REFERENCES `shark_owners` (`owner_id`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_0900_ai_ci

another 5 tables have this definition:

CREATE TABLE `shark_owners` (
  `name` mediumtext,
  `owner_id` bigint NOT NULL,
  PRIMARY KEY (`owner_id`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_0900_ai_ci

I could understand if the difference was about 2x - data is first written to a transaction log, and then it is committed to the database, but 5x? Is this a normal behavior for MySQL, or is something in my tables triggering this? And why are there so many "cancelled writes" - about 12% ?

How much memory did you allocate to Mysql? Is it swapping to disk? — doublesharp, Jun 18 '21 at 16:16
We need to see your script that performs the insertions. What is it doing _exactly_ for those 2 hours? Also, I don't think your idea of extrapolating averages like that is statistically sound - computer disk IO writes are rarely uniform. — Dai, Jun 18 '21 at 16:17
I would expect additional writes to the disk for indexes too. — Marc Baumbach, Jun 18 '21 at 16:18
@MarcBaumbach Good catch - [MySQL implicitly creates indexes for all foreign-keys](https://dev.mysql.com/doc/refman/8.0/en/constraint-foreign-key.html) - so if there's 5 tables with foreign-key references to `shark_owners` then that's 5 FK indexes and 5 PK indexes to maintain. — Dai, Jun 18 '21 at 16:19
@doublesharp - the EC2 and RDS instances have 16GB RAM. On the EC2 instance, "top" reports that over 90% of the memory is used for "buff/cache". — Gromit, Jun 18 '21 at 16:21
@MarcBaumbach and Dai - how many additional writes would indexes create? Could it be the extra 300%? — Gromit, Jun 18 '21 at 16:24
Not just indexes, but `mediumtext` itself only gets the first 768 bytes stored inline (as well as in the b-tree index if that's in play) and everything else is shoved in overflow pages with pointers out to the overflow, and that's just the tip of the iceberg regarding database storage methodology. You give data to a database and it's whole job is to make that data retrievable quickly and that means it's not going to be 1:1. — JNevill, Jun 18 '21 at 16:25
@Dai - the script is only doing inserts, committing every 50 inserts. As for extrapolating averages, the CloudWatch metrics (which are measured over 1 minute period), showed very low deviation between the measurements, typically within 1-3%. — Gromit, Jun 18 '21 at 16:30
If 300% writes are caused by table's structure, can someone suggest a table that wouldn't result in more than twice (logs+db) data being written? — Gromit, Jun 18 '21 at 16:45
@JNevill - even if mediumtext is not stored inline - why would the same data be written 5 times instead of 2 (log+DB)? Does MYSQL duplicate data in the overflow pages? — Gromit, Jun 18 '21 at 17:03
@Dai - I have re-created the tables without the foreign key - same behavior, 5x writes — Gromit, Jun 18 '21 at 17:15
Additional writes could be caused by the [binary log](https://dev.mysql.com/doc/refman/8.0/en/binary-log.html), the [doublewrite buffer](https://dev.mysql.com/doc/refman/8.0/en/innodb-doublewrite-buffer.html), or the different [query logs](https://dev.mysql.com/doc/refman/8.0/en/server-logs.html). — Bill Karwin, Jun 18 '21 at 17:25
@BillKarwin - Thank you! I was unable to disable binary log in the cnf, but I did disable the doublewrite_buffer. The test has completed in 30 minutes instead of 2 hours! 123098.9MB were written to the DB, for an average of 68.4MB/s! Now the disk writes were steady at 250MB/s - limited by bandwidth of GP2 volume. So the ratio of disk writes to the DB data committed is now much lower - ~3.66. How can I disable binary_log permanently in v8? Please post your previous suggestion as answer, so we can discuss it more — Gromit, Jun 18 '21 at 19:07

score 2 · Answer 1 · answered Jun 18 '21 at 18:05

LOAD DATA runs very fast, with minimal I/O
Bulk INSERT with at least 100 rows per query runs 10 times as fast as single-row Inserts.
autocommit causes at least one extra I/O after each SQL (for transactional integrity).
50 1-line Inserts, then a COMMIT is something of a compromise.
FOREIGN KEY requires checking the other table.
If innodb_buffer_pool_size is too small, there will be disk churn.
owner_id is a "secondary index". It is done in a semi-optimized way, but may involve both reads and writes, depending on a variety of things.
The tables would be smaller if you could use smaller datatypes. (Eg, BIGINT takes 8 bytes and is usually overkill.) Smaller would lead to less I/O.
How big is name? What ROW_FORMAT is used? They conspire to lead to more or less "off-record" storage, hence disk I/O.
Were you using multiple threads when doing the Inserts?

In other words, a lot more details are needed in order to analyze your problem.

I recreated the tables without the foreign key - same issue. "innodb_buffer_pool_size" is 134217728. "Name" is 32k characters. "ROW_FORMAT" is "dynamic". 5 MYSQL connections were used to fill the data - each connection filling 2 DBs. — Gromit, Jun 18 '21 at 19:10
128M is an old, terribly low, default. Set it to about 70% of _available_ RAM. Was the data read from disk? (That would count as more I/O.) — Rick James, Jun 18 '21 at 21:14
There are no reads, just writes. I will set it to 10GB and retest. — Gromit, Jun 18 '21 at 22:45

score 1 · Accepted Answer · answered Jun 18 '21 at 20:00

MySQL writes data several times when you use InnoDB tables. Mostly this is worth it to prevent data loss or corruption, but if you need greater throughput you may need to reduce the durability.

If you don't need durability at all, another solution is to use the MEMORY storage engine. That would eliminate all writes except the binary log and query logs.

You already mentioned the InnoDB redo log (aka transaction log). This cannot be disabled, but you can reduce the number of file sync operations. Read https://dev.mysql.com/doc/refman/8.0/en/innodb-parameters.html#sysvar_innodb_flush_log_at_trx_commit for details.

innodb_flush_log_at_trx_commit = 0

You might reduce the number of page flushes, or help MySQL consolidate page flushes, by increasing RAM allocation to the InnoDB buffer pool. Do not overallocate this, because other processes needs RAM too.

innodb_buffer_pool_size = XXX

The binary log is a record of all committed changes. You can reduce the number of file syncs. See https://www.percona.com/blog/2018/05/04/how-binary-logs-and-filesystems-affect-mysql-performance/ for description of how this impacts performance.

sync_binlog = 0

You can also disable the binary log completely, if you don't care about replication or point-in-time recovery. Turn off the binary log by commenting out the directive:

# log_bin

Or in MySQL 8.0, they finally have a directive to explicitly disable it:

skip_log_bin

Or

disable_log_bin

See https://dev.mysql.com/doc/refman/8.0/en/replication-options-binary-log.html#option_mysqld_log-bin for details.

The doublewrite buffer is used to protect against database corruption if your MySQL Server crashes during a page write. Think twice before disabling this, but it can give you some performance boost if you disable it. See https://www.percona.com/blog/2006/08/04/innodb-double-write/ for discussion.

innodb_doublewrite = 0

MySQL also has two query logs: the general query log and the slow query log. Either of these causes some overhead, so disable the query logs if you need top performance. https://www.percona.com/blog/2009/02/10/impact-of-logging-on-mysql’s-performance/

There are ways to keep the slow query log enabled but only if queries take longer than N seconds. This reduces the overhead, but still allows you to keep a log of the slowest queries you may want to know about.

long_query_time = 10

Another strategy is to forget about optimizing the number of writes, and just let them happen. But use faster storage. In an AWS environment, this means using instance storage instead of EBS storage. This comes with the risk that the whole database may be lost if the instance is terminated, so you should maintain good backups or replicas.

I was unable to disable binary log in the cnf - commenting "log_bin" out didn't disable the binary log, but I did disable the doublewrite_buffer. The test has completed in 30 minutes instead of 2 hours! 123098.9MB were written to the DB, for an average of 68.4MB/s! Now the disk writes were steady at 250MB/s - limited by bandwidth of GP2 volume. So the ratio of disk writes to the DB data committed is now much lower - ~3.66. — Gromit, Jun 19 '21 at 00:43
Now I have set some other suggested variables: `"innodb_doublewrite"` is "OFF", `"innodb_buffer_pool_size"` is "13958643712", `"sync_binlog"` is "0", `"innodb_flush_log_at_trx_commit"` is "0". Something weird happened - test has completed even faster - 27m vs 30m, but only 62800MB were written, instead of 123098.9MB - about half as much. Also, while total disk IO was 250MB/s, after first couple of minutes a lot of reads started to happen, sometimes exceeding writes. — Gromit, Jun 19 '21 at 00:44
You can clearly see in the CloudWatch statistics of the previous run that writes were ~100% - [https://i.imgur.com/wZYf4GF.jpeg](https://i.imgur.com/jzcQpWX.jpeg) Now it's much different - [https://i.imgur.com/wZYf4GF.jpeg](https://i.imgur.com/wZYf4GF.jpeg) — Gromit, Jun 19 '21 at 00:45
Remember that a line for `log_bin` may occur more than once in your config file, and the latter occurrence takes priority. — Bill Karwin, Jun 19 '21 at 01:21
There is only 1 `log_bin` - [https://justpaste.it/8w305](https://justpaste.it/8w305) — Gromit, Jun 19 '21 at 10:14
I have retested with `innodb_doublewrite=OFF` and storage provisioned for 500MB/s, to avoid the 250MB/s cap. Now the test has completed in 20 minutes - 6 times faster than the initial test! Thank you! — Gromit, Jun 22 '21 at 01:41
Why would the doublewrite make such a big difference? In [your article](https://www.percona.com/blog/2006/08/04/innodb-double-write/) you have estimated that it only should have a 5-10% impact. — Gromit, Jun 22 '21 at 01:42
I suppose it depends on rate of page flushes. If you're doing this during an aggressive bulk data load, you're probably forcing through a lot more page flushes than a typical OLTP application. So the difference in performance between doublewrite on and doublewrite off is counted more frequently. — Bill Karwin, Jun 22 '21 at 02:17

Why is MySQL writing 5 times more data to disk, than is committed to the database?

2 Answers2