>
Home

Latest Post

Happiness Hint: Alarm on Checkpoint Time

Before starting, I want to put a few things at the top:

  1. You probably should not change the checkpoint_timeout setting from its default of 5 minutes. Users who read tuning advice on the internet (or get bad AI advice) about increasing this setting usually don’t understand the risk and significance of trading for RTO/availability. Let’s be honest: many of us don’t scrutinize RTO until we have a real incident and suddenly realize that there is no way to get our application back online. Turns out it does matter to your boss if you’re down! Increasing this parameter can turn a short outage into a long multi-hour outage.
  2. Always make sure log_checkpoints is enabled. This has been the default since Postgres v15.
  3. With a default checkpoint_timeout setting of 5 minutes, I’m leaning toward a default alarming threshold of 15 minutes on total checkpoint time. You may need to increase disk hardware specs or optimize database workload, if checkpoints take too long.
    • There are two ways this alarm can be built – either by parsing checkpoint messages after they appear in the log, or by looking at the current time and checking how long since the last checkpoint message appeared. The latter is a more robust solution because it will alarm even if no checkpoint message is ever produced.
  4. Don’t forget to also monitor for “checkpoints are occurring too frequently” messages in the Postgres log. In some cases, you might consider increasing max_wal_size if checkpoints are frequent. Remember that it’s ok if there are short occasional bursts of write activity. (It’s also completely ok – even healthy – if disks have bursts that hit the peak IOPS.) The thing to watch for is real user & application impact due to extended throttling.

Checkpoint is the heart of your database. It’s buried deep inside. It’s not something everyone talks about, like well-tuned autovacuum or fast queries. But if checkpointer stops beating, then you’re dead.

In addition to its well-understood job of getting dirty pages written from cache to disk in the background, it also has many smaller jobs that are less widely known. Management of a few shared-memory config settings like sync_standby_names and full_page_writes. Fsync batching. Deferred file unlinks. Enforcement of archive_timeout.

A few years ago, I added a happiness hint to have an alarm on the “time since latest checkpoint”. This was partly due to an incident I saw many years ago but which I never managed to blog about. I saw another checkpoint related incident recently, so I thought I’d gather these thoughts together. Both incidents reveal an important lesson in hindsight: it’s dangerous to restart a database when there are checkpoint problems.

The First Incident

Lets see what I can remember about that original incident. It was an ugly 40 hour production outage that happened back in the postgres version 12 era. Someone started getting errors and restarted their postgres database to try to remediate, and the database simply never came back up.

I remember that there were four different things which all combined to make this incident so bad:

  1. The workload made heavy use of unlogged tables. It created and dropped tables at a high rate.
  2. Somehow there had not been a successful checkpoint on this database for a week. There was insufficient data to figure out why, after we finally got the system back online.
  3. Remember that one job of checkpointer is to handle deferred file unlinks. Over the course of the week, while checkpointer was stalled and never unlinked any of the underlying files for the DROP <relation> statements, the system accumulated 27 million files on the filesystem until it ran out of inodes (before the database was restarted).
  4. The database was on a large server and had a reasonably large buffer cache. In Postgres, any time a relation is dropped, the entire buffer cache is scanned to remove blocks belonging to that relation. (Some other databases maintain an index type of structure for this but Postgres does not.) This includes unlogged tables and their indexes.
  5. When Postgres starts, the first thing it needs to do is replay write-ahead logs starting at the last successful checkpoint. This database needed to replay more than twenty-five thousand WAL files.

The reason the database never came back up was that WAL replay proceeded at a crawl (due to buffer cache scans), and after running for many hours it would inevitably cause the system to run out of inodes again. It would crash before it could complete recovery. When it restarted again, it lost all progress and went back to the beginning.

Eventually we shrank the buffer cache to speed up replay, but recovery was still incredibly slow – and at the end of the day, as stated above, this ended up being almost two days of production downtime.

At the root of this was a severely lagging checkpointer.

The Second Incident

The next incident was more recent and involved a hot standby instance, and a multi-hour outage for that hot standby instance. This time it was a CloudNativePG database running on Kubernetes – but similarly to before, the application was experiencing problems and a database restart was triggered as an early remediation. The hot standby instance never came back up.

Again, a few different factors combined to cause the incident:

Continue reading

What is Ardent?

ADJECTIVE:
1. Warmth of feeling; passionate
2. Strong enthusiasm or devotion; fervent
3. Burning/fiery or glowing/shining
(American Heritage Dictionary)

Social

As of 2025: I'm on LinkedIn most. Also Slack and Discord but don't have Discord invite links handy. I check Twitter/X on occasion. Haven't been on IRC regularly since the old days, before the PG folks moved to Libera. I've de-supported all other (old) social accounts listed here, but I'll keep them handy for the Zombie Apocalypse.

LinkedIn: linkedin.com/in/ardentperf/
Slack: jer_s@pgtreats.info/slack-invite

Twitter/X: jer_s
IRC: jer_s@FreeNode (#postgresql, #ansible, #oracle, ##oracledb)
AIM, MSN, Google: jeremy.schneider@ardentperf.com
Yahoo: ardentperf
ICQ: 614052660

Disclaimer

This is my personal website. The views expressed here are mine alone and may not reflect the views of my employer.

contact: 312-725-9249 or schneider @ ardentperf.com


https://about.me/jeremy_schneider

oaktableocmaceracattack

(a)

Enter your email address to receive notifications of new posts by email.

Join 75 other subscribers