Before starting, I want to put a few things at the top:
checkpoint_timeout setting from its default of 5 minutes. Users who read tuning advice on the internet (or get bad AI advice) about increasing this setting usually don’t understand the risk and significance of trading for RTO/availability. Let’s be honest: many of us don’t scrutinize RTO until we have a real incident and suddenly realize that there is no way to get our application back online. Turns out it does matter to your boss if you’re down! Increasing this parameter can turn a short outage into a long multi-hour outage.log_checkpoints is enabled. This has been the default since Postgres v15.max_wal_size if checkpoints are frequent. Remember that it’s ok if there are short occasional bursts of write activity. (It’s also completely ok – even healthy – if disks have bursts that hit the peak IOPS.) The thing to watch for is real user & application impact due to extended throttling.Checkpoint is the heart of your database. It’s buried deep inside. It’s not something everyone talks about, like well-tuned autovacuum or fast queries. But if checkpointer stops beating, then you’re dead.
In addition to its well-understood job of getting dirty pages written from cache to disk in the background, it also has many smaller jobs that are less widely known. Management of a few shared-memory config settings like sync_standby_names and full_page_writes. Fsync batching. Deferred file unlinks. Enforcement of archive_timeout.
A few years ago, I added a happiness hint to have an alarm on the “time since latest checkpoint”. This was partly due to an incident I saw many years ago but which I never managed to blog about. I saw another checkpoint related incident recently, so I thought I’d gather these thoughts together. Both incidents reveal an important lesson in hindsight: it’s dangerous to restart a database when there are checkpoint problems.
Lets see what I can remember about that original incident. It was an ugly 40 hour production outage that happened back in the postgres version 12 era. Someone started getting errors and restarted their postgres database to try to remediate, and the database simply never came back up.
I remember that there were four different things which all combined to make this incident so bad:
DROP <relation> statements, the system accumulated 27 million files on the filesystem until it ran out of inodes (before the database was restarted).The reason the database never came back up was that WAL replay proceeded at a crawl (due to buffer cache scans), and after running for many hours it would inevitably cause the system to run out of inodes again. It would crash before it could complete recovery. When it restarted again, it lost all progress and went back to the beginning.
Eventually we shrank the buffer cache to speed up replay, but recovery was still incredibly slow – and at the end of the day, as stated above, this ended up being almost two days of production downtime.
At the root of this was a severely lagging checkpointer.
The next incident was more recent and involved a hot standby instance, and a multi-hour outage for that hot standby instance. This time it was a CloudNativePG database running on Kubernetes – but similarly to before, the application was experiencing problems and a database restart was triggered as an early remediation. The hot standby instance never came back up.
Again, a few different factors combined to cause the incident:
Continue reading
Recent Comments