PostgreSQL - checkpoint interval behaviour in different WAL levels

Question

I couldn't find a definite answer for my concerns, so I might as well ask it from you guys!

Long story short: We need to perform an UPDATE command on roughly 400M rows. The command could be modified to work in batches I know, but that is a different topic. Our problem is that the WAL gets too big and we run out of disk space. I'm wondering how the checkpoint intervals work with different WAL levels. To put it simply the documentation says that a longer checkpoint interval "triggers" less full page writes, which results in a smaller WAL. What I can't find is how this change behaves with different wal_level settings.

DB version: Postgres14.4

1. Does it have any relevance with a minimal wal_level setting? (Considering it removes almost all logging.)

2. Does it break the replicas when the wal_level is set to replica or higher? (It isn't obvious to me based on different articles and the documentation, but I assume the replicas should be fine since all the changes are logged despite of fewer full page/block writes, and it also can be benefitial i.e. decreased WAL size.)

We are in a position where a full backup and shutdown of related application is possible, so the minimal wal_level setting could work, but I'm interested in different solutions as well, feel free to share some thoughts on it.

Cheers!

score 1 · Answer 1 · answered Mar 07 '23 at 13:09

1

wal_level = minimal won't make a difference. As long as you don't set it to logical, PostgreSQL should produce about the same amount of WAL. It will break replication if you set wal_level to something lower than replica.

The obvious solution is to add more disk space. If the problem is WAL archiving, you can disable archive_mode. If the problem is that checkpoints take too long to complete, you could run a manual CHECKPOINT command.

Increase max_wal_size to reduce the amount of WAL written. Yes, I know that it sounds strange, but max_wal_size does not govern the size of pg_wal, but it triggers checkpoints (which increase the number of full page images written).

answered Mar 07 '23 at 13:09

Laurenz Albe

209,280
17
206
263

Thank your for your answer! It does makes sense to increase max_wal_size, we'll try that out! It basically does the same like the increased checkpoint_interval, right? – Bylaw Mar 07 '23 at 16:18
In a way. High `checkpoint_timeout` won't have an effect unless you increase `max_wal_size`. But yes, both work in the direction of having fewer checkpoints. – Laurenz Albe Mar 07 '23 at 16:26
We tried with a bigger max_wal_size as well, but the WAL increased to an unexpectedly high size again. As @jjanes mentioned, probably the whole update command with its parts goes into the WAL and stays there until the transaction is committed. I had to choose the obvious solution, i.e. slicing up the update into smaller batches. Like this, the WAL will have time to empty itself and/or cleared by the vacuum. Thank you for your answer though, it was useful! – Bylaw Mar 09 '23 at 12:20
I am still not quite certain where the problem is. WAL gets removed during the next checkpoint, no matter how much you generate. Or are you talking about archived WAL? – Laurenz Albe Mar 09 '23 at 12:54
No, archiving was set to off. I assume it is due to the size of the update query (~380M affected rows). The WAL seems to keep it until the transaction finishes. Maybe this overwrites the checkpoint logic, I'm not sure, but that is what we saw. With all the settings you and I mentioned, WAL still kept rising heavily. – Bylaw Mar 09 '23 at 13:20
No, WALnis absolutely **not** kept until the transaction finishes, unless you are using logical decoding. – Laurenz Albe Mar 09 '23 at 13:22
1

Weird! I believe you, but then I have no clue what happened. Like I said the only thing I can think of is that PITR was somehow affecting WAL. I didn't have more time to try so I went with a DO block with a LOOP, which updates in smaller batches. This immediately solved the issue, now WAL size is staying very low. – Bylaw Mar 09 '23 at 13:31
One more question: The script updating in smaller batches runs pretty fast, updates 10k rows in roughly 1 sec, but WAL size is increasing rapidly again, until the vacuum does its job. It is pretty scary though. Am I right that now a tighter checkpoint_interval would be better since the updates are coming in batches instead of one big command? – Bylaw Mar 13 '23 at 15:33
I cannot tell from here what exactly you are observing. Perhaps `VACUUM` running in the background generates WAL too. – Laurenz Albe Mar 13 '23 at 17:42

score 0 · Accepted Answer · answered Mar 07 '23 at 16:58

Does it have any relevance with a minimal wal_level setting? (Considering it removes almost all logging.)

It doesn't. With minimal, you only skip WAL logging of a few things, like COPY into a table which was created or truncated in the same transaction, or the creation of indexes. Those special cases wouldn't apply to a bulk UPDATE.

To solve the problem, you first need to figure out what the root problem is. Are you so close to the out-of-space condition under normal conditions than any stress at all can push you over? Do you have replication slots, and the standbys can't keep up? Do you have an archive_command that can't keep up? Is your IO system so overwhelmed that the checkpoints can't finish in time despite trying as fast as they can? Is you max_wal_size writing checks your harddrive can't cash?

I did not know that about the minimal wal_level. It coheres with the results we got while trying with different settings without success. There are no replicas and archiving is turned off as well. One more thing we can think of is that something triggers the PITR mechanism in the background and makes the logging very much detailed. Regardless, I wrote a script which updates in batches, so the WAL will have time and space to get emptied. — Bylaw, Mar 09 '23 at 12:26

score 0 · Answer 3 · answered Mar 17 '23 at 14:27

0

make sure about some points:

that checkpoint_time and max_wal_size are not too small.
that you your archive_command is working.

These point are importants to avoid flooding the I/O system.

answered Mar 17 '23 at 14:27

Jessé Filho

41
1
7

PostgreSQL - checkpoint interval behaviour in different WAL levels

3 Answers3