0

Summary

I am using Postgres UPSERTs in our ETLs and I'm experiencing issues with fragmentation and bloat on the tables I am writing to, which is slowing down all operations including reads.

Context

I have hourly batch ETLs upserting into tables (tables ~ 10s of Millions, upserts ~ 10s of thousands) and we have auto vacuums set to thresholds on AWS.

I have had to run FULL vacuums to get the space back and prevent processes from hanging. This has been exacerbated now as the frequency of one of our ETLs has increased, which populates some core tables which are the source for a number of denormalised views. It seems like what is happening is that tables don't have a chance to be vacuumed before the next ETL run, thus creating a spiral which eventually leads to a complete slow-down.

Question!

Does Upsert fundamentally have a negative impact on fragmentation and if so, what are other people using? I am keen to implement some materialised views and move most of our indexes to the new views while retaining only the PK index on the tables we are writing to, but I'm not confident that this will resolve the issue I'm seeing with bloat.

I've done a bit of reading on the issue but nothing conclusive, for example --> https://www.targeted.org/articles/databases/fragmentation.html

Thanks for your help

Rob
  • 3
  • 3
  • @LaurenzAlbe gives good advice below. Also, make sure that there are no long running transactions that would prevent autovacuum from cleaning up the tables. – Jeremy Jan 14 '20 at 13:17
  • "auto vacuums set to thresholds" What does that mean? – jjanes Jan 14 '20 at 15:05
  • @jjanes I mean to say, vacuum thresholds are set appropriately in AWS. Large tables with very low scale factors, with a reasonable threshold to avoid having to vacuum the small tables. – Rob Jan 14 '20 at 16:26

2 Answers2

0

It depends. If there are no constraint violations, INSERT ... ON CONFLICT won't cause any bloat. If it performs an update, it will produce a dead row.

The measures you can take:

  • set autovacuum_vacuum_cost_delay = 0 for faster autovacuum

  • use a fillfactor somewhat less than 100 and have no index on the updated columns, so that you can get HOT updates, which make autovacuum unnecessary

Laurenz Albe
  • 209,280
  • 17
  • 206
  • 263
0

It is not clear what you are actually seeing. Can you turn track_io_timing on, and then do an EXPLAIN (ANALYZE, BUFFERS) for the query that you think has been slowed down by bloat?

Bloat and fragmentation aren't the same thing. Fragmentation is more an issue with indexes under some conditions, not the tables themselves.

It seems like what is happening is that tables don't have a chance to be vacuumed before the next ETL run

This one could be very easy to fix. Run a "manual" VACUUM (not VACUUM FULL) at end or at the beginning of each ETL run. Since you have a well defined workflow, there is no need to try get autovacuum to do the right thing, as it should be very easy to inject manual vacuums into your workflow. Or do you think that one VACUUM per ETL is overkill?

jjanes
  • 37,812
  • 5
  • 27
  • 34