Misunderstanding about Page Cache and dirty_background_bytes

Question

I've been looking at this for a while now and things aren't lining up with my expectations, but I don't know if it's because something is off, or if my expectations are wrong.

So, I've got a system with over 100GB of RAM, and I've set my dirty_background_bytes to 9663676416 (9GB) and dirty_bytes to 2x that (19327352832 or 18GB)

In my mind, this should let me write up to 9GB into a file, but really it just sits in memory and doesn't need to hit disk. My dirty_expire_centisecs is the default of 3000 (30 seconds).

So when I run:

# dd if=/dev/zero of=/data/disk_test bs=1M count=2000

and ran:

# while sleep 5; do egrep 'Dirty|Writeback' /proc/meminfo | awk '{print $2;}' | xargs; done

(Printing Dirty bytes in kb, Writeback in kb, and WritebackTmp in kb at 5s snapshots)

I would have expected to see it dump 2GB into the page cache, sit there for 30 seconds, and then start writing the data out to disk (since it never went above the 9GB background ratio)

Instead what I saw was:

3716 0 0
4948 0 0
3536 0 0
1801912 18492 0
558664 31860 0
7244 0 0
8404 0 0

Where as soon as the page cache jumped, it was already writing data out, until we were back down to where we started.

What I'm actually working on is basically trying to see if my process bottleneck is disk IO or some other factor, but in the middle I got confused by this behaviour. I figure so long as the process is still running in the buffer-zone disk write performance shouldn't really be relevant, since it should just be dumping to memory.

So, am I misunderstanding the way these features are supposed to work, or is something strange going on?

This can be a side-effect of your `dd` command unlinking and creating a new `disk_test` at each iteration. Try to first create a target file with a *single* `dd if=/dev/zero of=/data/disk_test bs=1M count=2000` command, then run your loop with `dd if=/dev/zero of=/data/disk_test bs=1M count=2000 conv=notrunc,nocreat` command. — shodanshok, Feb 07 '18 at 18:58
Hmm. That still wasn't 100% inline with my expectations, but it does seem to be a bit closer... I got `3744, 2052996, 2053988, 1932948, 771472, 5532` one of the times I ran it, which had periods where it just held some data (and also the reported data-rate by `dd` was much higher). So it still didn't last for 30 seconds, but why would it make a difference? — psycotica0, Feb 07 '18 at 20:20
When I wrote 4GB instead of 2GB, and then wrote 2 or 3 GB after that with `notrunc` it did what I expected, was that it held the contents of the files fully in the dirty pages for 30 seconds before writing them out. Is there something that behaves differently when adding to the end of a file? — psycotica0, Feb 07 '18 at 20:34
`notrunc` made a difference because in the past some heuristic was added to prevent applications doing replace-via-rename and replace-via-truncate and crashing just afterwards to corrupt their data. This heuristic basically force-flushes data belonging to open->written->truncated files. I'll write an answer, feel free to accept it. — shodanshok, Feb 07 '18 at 21:06

shodanshok · Accepted Answer · 2019-10-30T06:51:43.160

This can be a side-effect of your dd command unlinking and creating a new disk_test at each iteration.

Try to first create a target file with a single dd if=/dev/zero of=/data/disk_test bs=1M count=2000 command, then run your loop with dd if=/dev/zero of=/data/disk_test bs=1M count=2000 conv=notrunc,nocreat command.

Explanation: notrunc make a difference because in the past some heuristic was added to prevent applications doing replace-via-rename and replace-via-truncate and crashing just afterwards to corrupt their data. This heuristic basically force-flushes data belonging to open->written->truncated files.

From mount man page:

auto_da_alloc|noauto_da_alloc

Many broken applications don't use fsync() when noauto_da_alloc replacing existing files via patterns such as

fd = open("foo.new")/write(fd,..)/close(fd)/ rename("foo.new", "foo")

or worse yet

fd = open("foo", O_TRUNC)/write(fd,..)/close(fd).

If auto_da_alloc is enabled, ext4 will detect the replace-via-rename and replace-via-truncate patterns and force that any delayed allocation blocks are allocated such that at the next journal commit, in the default data=ordered mode, the data blocks of the new file are forced to disk before the rename() operation is commited. This provides roughly the same level of guarantees as ext3, and avoids the "zero-length" problem that can happen when a system crashes before the delayed allocation blocks are forced to disk.

Give also a loot at XFS FAQ

Yup, I've confirmed that remounting with `noauth_da_alloc`, `dd` with no special options behaves the way I expected from the start. Thanks for teaching me something! — psycotica0, Feb 12 '18 at 15:53

Misunderstanding about Page Cache and dirty_background_bytes

1 Answers1