Ardent Performance Computing

PGConf.dev 2026 Trip Summary

Jeremy — Tue, 26 May 2026 15:46:04 +0000

I’m back home from Vancouver. What a great week – in every way. I’ll try to share a few highlights here.

Updated Happiness Hints

First and foremost: after many years, the Happiness Hints have received a major update! Before the conference, I updated the hints based on all the feedback I’ve collected over the past few years. Then the hints were updated into a poster format and we printed it as part of the pgconf.dev poster session. Throughout the week, I continued collecting more feedback. I used a sharpie during the conference and marked up the poster with ideas. Special thanks to Laurenz Albe, David Rader, Sami Imseih, Ryan Booz and Nik Samokhvalov (Nik you weren’t at the conference but a happiness hint resulted from other discussions we’ve had). Of course I’m forgetting more people who gave feedback making the happiness hints better. After coming home from the conference, I incorporated all the notes I had – and the version that’s now published here at ardentperf.com is the latest & best version I’ve assembled so far.

Physical Replication and Postgres High Availability

An extraordinary number of postgres users rely on physical replication for high availability. It’s been around for a long time and it works well. Nonetheless, there are a few rough edges and over the years there have been various mailing list threads that haven’t fully been resolved.

I proposed a Friday unconference session on this topic, and the topic received enough votes to be selected. Notes from the unconference are available on the Postgres wiki. But the discussion extended far beyond the unconference; there were also hallway discussions over coffee (thanks Thomas Munro) and then continuing discussions over dinner at Joey Burrard and beers at Steamworks (thanks Ants Aasma).

The first question that everybody asks is “should postgres have more HA capabilities in core”? And a discussion starting along these lines consumed the first half of the unconference.

But I thought the most interesting train of thought was something more incremental – an idea that Postgres is missing a fundamental/overarching concept or first principle which could make a lot of problems easier to solve – a concept around cluster topology. There are a few ways this could look. A function or a view like pg_nodes or something? The ability on a hot standby to query for all of the replicas in the topology? How about a function that could be called on a hot standby when the primary is unreachable and return a list of potential candidates for promotion to be a new primary?

My own idea is to consider the set of nodes in synchronous_standby_names as the “cluster” or “herd” of instances. (Jeff didn’t like the name “herd” but the word “cluster” already means something else in postgres…) Maybe we can let people set the number to “0” if they want a cluster with async replication. Which brings us to another challenge – managing changes to this parameter. First, how do we know the exact moment when every single connection and session is aware of a new set of cluster members? Remember that individual connections are responsible to ensure transactions are replicated before acknowledging commits to clients. Second, how could we ensure that all of the replicas know about changes when adding or removing cluster members?

There are also challenges around logical replication slots (like losing them after two failovers in a row, or the inability to replicate them at all if decoding from standbys) – could a new cluster concept help? A new cluster concept also might help around managing backups of WAL across a cluster. Lots of interesting ideas!

Wait Events and Physical Reads and pg_stat_statements

A handful of short discussions with Sami Imseih and Lukas Fittl. First off, Sami has some patches for pg_stat_statements that I’m pretty excited about. Improving concurrency around the LWLock and looking for ways to optimize the situation with the query text file.

Second, I had a few chats around physical reads. Right now I’m using the pg_stat_kcache extension to get data on physical reads. Postgres itself only tells reads that happen from the OS page cache. There’s ongoing work around direct IO, and also Postgres 18 will get a new AIO feature… and I’m curious if pg_stat_kcache will be able to get data about the background IO workers in pg18. There was some concern around the overhead of calling getrusage() too frequently; some benchmarking would be good, to determine if the overhead is too high to get per-query physical reads from AIO workers. (I’m expecting io_uring to be unavailable in many containerized environments; I think that GKE servers and also Docker’s default seccomp profile disable it.) I wonder if some users will want to disable the IO workers purely so they can continue getting physical read stats.

Third were a few small side conversations on better observability around Wait Events and Locks. For wait events, I think that we should be able to add counters to keep track of the number of times every wait event is called and the total duration for each wait event. But what about LWLock Wait Events? Too much overhead? It turns out that LWLocks don’t register wait events if they can quickly acquire a lock. They only register a wait if they actually relinquish the CPU to wait on a semaphor – so I think the overhead of maintaining counters on waits might be acceptable (and essential for debugging). Separately from this, I think that we also might be able to find a way to count the total number of times each LWLock is acquired – but it would need to be very efficient to be enabled all the time. (Postgres has LWLOCK_STATS already as a build flag but it’s not typically enabled.) I suspect we might want counters that are local to each process, and only aggregate them to central stats at some conservative interval.

Collation

It would hardly be a postgres conference if Jeff Davis and I didn’t have at least one conversation about Collation where we both insist that we’re now retired from collation work, then spend an hour debating how to best move Postgres forward.

Amazing work was done. But also, there is still more work to do.

The big problem nobody’s talking about is that language changes. ICU needs to be upgraded. Linguistic sort order is like time zones. It’s rare, but it changes – and when the sort order changes, all your indexes become invalid. Postgres does not have any good story yet for ICU upgrades.

Postgres now has a builtin stable code-point-order collation (pg_c_utf8). It’s possible to set this as the database default and do your linguistic sorting at the expression or column level. (Which you should! And it’s a happiness hint!) But lets be real: users in non-english languages don’t want to go through their entire schema or application adding COLLATE "fr_FR.utf8" everywhere.

The million-dollar question is “what do users really want?”

Can we come up with some limited “client locale” concept that gives users default behavior according to their client locale, while the database itself (and all indexes) operate with pg_c_utf8 collation? Maybe users only really care about ordering of results? The ORDER BY matters to them, but they actually might not really expect or care about the less-than operator? I think the ideal behavior is somehow that indexes are always created with pg_c_utf8 collation, while users can have a good experience that doesn’t require adding COLLATE clauses everywhere. The challenge is how to figure a way that pg_c_utf8 indexes can be used most of the time.

FWIW, Oracle takes a very interesting (if pragmatic) approach here – they just list all the operators, and some default to binary/codepoint collation while others default to the client locale. (Of course collation can always be explicitly specificed; this is just for defaults.) Indexes are always created binary/codepoint and indexes are generally used by queries. In Postgres, could the bttextcmp() function in postgres be tweaked somehow so that it can use pg_c_utf8 indexes by default even when the user requests linguistic collation? Or could we look at query execution plans and only apply linguistic collation to top-level nodes somehow? Crazy ideas, not sure any of it works, we’re still brainstorming.

Lightning Talks and Dinner Groups

Two final brief mentions. This year, Masahiko Sawada and myself organized the Lightning Talks. First time I’ve done it. We mostly just followed the same process which had been used last year – there was a very helpful google doc which we followed (and updated). From 29 total submissions, we randomly chose 12. Four people used green cards to indicate new/inexperienced speaker and we made sure that 2 of those were included. Every speaker gets 5 minutes max!

I learned that originally, Lightning Talks at pgcon were first-come-first-serve. As the conference grew, the Lightning Talks switched to random selection. Submissions were done at the conference by putting a note card into a box with your name and topic. I overheard a little discussion on Friday around whether lightning talks should move to a model of online submissions in the future, maybe ahead of time, more like a real CFP with a selection process instead of purely random selection.

I’m new here, but I do think there’s something that feels a little more authentic when it’s a physical submission at the conference and a random selection. Fits with the theme of this conference – ample time for hallway discussions and impromptu topics. Online submission feels a bit different; there are pros and cons both ways.

One final thing this year was that Paul Ramsey organized dinner groups on Tuesday and Thursday. What a fantastic idea! I was part of dinner groups on both days and really enjoyed meeting new people and having some great conversations. I forget to get a picture on Thursday, but here’s the group from Tuesday.

I wasn’t originally planning to attend this conference – and I’m very glad that I decided to go. I hope I’m able to attend another pgconf.dev in the future!

Zero autovacuum_vacuum_cost_delay, Write Storms, and You

Jeremy — Mon, 13 Apr 2026 05:10:35 +0000

A few days ago, Shaun Thomas published an article over on the pgEdge blog called [Checkpoints, Write Storms, and You]. I left a few comments [on LinkedIn]. This article is a great technical read about an important and overlooked topic – and I always love seeing real test results to illustrate the details.

I don’t have any reproducible real test results today. But I have a good story and a little real data.

Vacuum tuning in Postgres is considered by some to be a dark art. Few confidently say: “Yes I know the right value for autovacuum_vacuum_cost_delay.” The documentation gives guidance, blog posts give opinions. Eventually I thought, “Ok what’s the worst that could happen if this one was set to zero?”

The story starts with some unexplained, intermittent application performance problems. We were doing some internal benchmarking to see just how far we could push a particular stack and see how much throughput a specific application could get. Everything hums along fine until suddenly – latency would spike across the board and the application would choke, causing backlogs and work queues to blow up throughout the system.

Where do you start when you have application performance problems? Wait Events and Top SQL – always! I’m far from the first person to evangelize this idea; I’ve said many times that wait events and top SQL are almost always the fastest way to discover where the bottlenecks are when you see unexpected performance problems. My [2024 SCaLE talk about wait events] gets into this.

So naturally I dug into the wait events and top SQL – and I noticed these slowdowns lined up perfectly with spikes in COMMIT statements on IPC:SyncRep waits. This wait event is not well understood. Last October I published an article [Explaining IPC:SyncRep – Postgres Sync Replication is Not Actually Sync Replication] with more explanation – but essentially it means the replicas were lagging behind and the primary was blocking on commit acknowledgments.

Notice how there are periodic spikes of hundreds of connections waiting on IPC:SyncRep for this system during the test runs: (nb. the plain colon represents CPU time)

That led me to check network traffic, which showed corresponding bursts of traffic between the primary and replicas. Something was periodically creating giant spikes of WAL.

So, I went hunting in the WAL itself. Using pg_walinspect on Postgres 16, I broke down records by resource manager and found massive surges from XLOG; specifically from full-page image (FPI) writes. These weren’t steady; they came in waves and caused serious commit latency waiting for downstream replication.

Here’s a graph of the record_size and fpi_size bytes per resource type during two benchmark runs:

I dumped the WAL and in the first sample I see it’s dominated by FPI_FOR_HINT blocks in sequential order from a specific 40GB toast table. I only see INSERT in pg_stat_statements for this table.

This confused me. Looking through Postgres source code, two possible sources I saw were log_newpage*() and MarkBufferDirtyHint() and I thought: where are these hint updates coming from? A Postgres SELECT can dirty pages by setting tuple hint bits when it reads rows whose inserting or deleting transaction has committed but whose visibility status has not yet been cached in the tuple header, which commonly happens after recent inserts, updates, deletes, or other write activity. Some napkin math suggested that 20,000 tuple ins/upd/del per second can dirty 10GB in one minute with hints. (We were close to 20k tuples/s in the run on the left side; second workload is over 30k/s.) Maybe spikes in dirty buffers were triggering forced checkpoints?

But after taking a look, the problem here wasn’t checkpoints themselves. A lot more checkpoints were happening than WAL spikes and there was no correlation between the timings. (You should still read Shaun’s blog though!)

But this is where the trail leads me to start thinking about autovacuum. From autovacuum logs (always enable these) I could see that the timing of autovacuum runs aligned perfectly with each WAL storm.

And then I had another realization: autovacuum_vacuum_cost_delay was set to 0 on this system. Crazy theory number two: vacuum is setting hint bits on an append only table very fast, and maybe the longer the gap between checkpoint and vacuum, the worse the damage? Remember that the first time a page is modified after checkpoint, the full page is written into the WAL log to protect from torn writes during system failures (because the OS block size usually doesn’t align with the database block size). Even if the update is just setting a hint bit – the WAL record can include the full 8k database block.

Without any cost throttling, autovacuum was racing through large tables at full speed, dirtying pages with hint bit updates – writing the full pages to the WAL log faster than the system could replicate them. That triggered bursts of WAL traffic, replication lag, and the intermittent major performance hiccups that had started this whole chase.

We reverted autovacuum_vacuum_cost_delay to its default 2ms, reran the workload, and everything smoothed out beautifully. You can still see the XLOG records generated by autovacuum, but they were more spread out. The WAL volume didn’t swing as wildly, replication didn’t crash as dramatically, and application latency spikes no longer overwhelmed backlogs and work queues. There was still variance in the performance – but we could tune the application to handle it, and we got much higher overall throughput without tipping everything over.

In hindsight, I remember seeing that setting early on and thinking,

“it’s a big server & workload, cost delay 0 won’t do anything that bad, won’t completely burn down the server, so I can probably let that one stay where it is”

I was completely wrong.

Moral of the story:
Vacuum tuning may feel like a dark art, but the defaults exist for good reason. Even one millisecond of cost delay keeps autovacuum from overwhelming the system and flooding WAL. Checkpoints and pg_repack and materialized view refreshes aren’t the only things that cause write storms; autovacuum can cause them too.

In other words: resist the temptation to go full throttle – your replicas and your applications and your future-self will thank you.

Database Schema Migrations in 2026 – Survey

Jeremy — Thu, 26 Mar 2026 05:31:12 +0000

What is the best way to manage database schema migrations in 2026?

Since this sort of thing is getting easier with AI tooling, I spent some time doing a survey across a bunch of recognizable multi-contributor open source projects to see how they do database schema change management.

Biggest takeaway: the framework provided by your programming language is the most common pattern. After that seems to be custom project-specific code. Even while Pramod Sadalage and Martin Fowler’s twenty-year-old general evolutionary pattern is followed, I was surprised to see very few occurrences of the specific tools they listed in their 2016 article about Evolutionary Database Design. Those tools might be used behind some corporate firewalls, but they aren’t showing up in collaborative open source projects.

Second takeaway: it should be obvious that we still have schema migrations with document databases and distributed NoSQL databases; but lots of interesting illustrations here of what it looks like in practice to deal with document models and NoSQL schemas as they change over time. My recent comment on an Adam Jacob LinkedIn post:“life is great as long as changing your schema can remain avoidable (ie. requiring some kind of migration).”

What about the method of triggering the schema migrations? The most common pattern is that the application process itself triggers schema migration. After that we have kubernetes jobs.

The rest of this blog post is the supporting data I generated with some AI tooling. I made sure to include links to source code, for verifying accuracy. I spot checked a few and they were all accurate – but I didn’t go through every single project.

If you spot errors, please let me know!! I’ll update the blog.

Update Mar 26: On LinkedIn, Elizabeth Christensen mentioned last year’s virtual meetup about this topic [recording available on YouTube]. And I hadn’t originally mentioned it in this post, but probably worth pointing out that I think the three broad categories of schema change management tools are: (1) app frameworks [which this blog is focused on], (2) DB-agnostic [liquibase, flyway, sqitch, atlasgo, etc] and (3) DB-specific [pgroll, oracle edition based redefinition, etc] – it’s a very interesting landscape!

A survey of how major open-source projects handle database schema migrations. Each project includes a real code example and how migrations are triggered during upgrades.

Kubernetes Migration Trigger Methods

Projects with no official Helm chart or k8s support (Mastodon, Discourse, Sentry, Zulip, NetBox, Metabase, Lemmy, MediaWiki, Matrix Synapse†, CHT Core, Signal Server, Firefox, Chromium, Signal Desktop, FDB Record Layer, RxDB) are omitted.

Trigger Method	Projects
Dedicated k8s Job (Helm hook)	GitLab (post-deploy), Airflow (post-install/upgrade), Superset (post-install/upgrade), Temporal (pre-deploy), Kong (pre-install), Jaeger (pre-deploy), ThingsBoard (install only; upgrades require a separate manual pod)
Init container in pod spec	Gitea (official chart runs `gitea migrate` in init container before main container starts)
App process migrates on pod startup	Ghost, Backstage, Keycloak, Grafana, Mattermost, Odoo, Parse Server, Appsmith, Rocket.Chat, Graylog
Triggered by action against running process	WordPress (first admin HTTP request), Kubernetes (`StorageVersionMigration` CRD triggers in-cluster controller), Dgraph (`POST /admin` API call; async index rebuild)
Manual operator action	Calico (`calico-upgrade` CLI), Neo4j-Migrations (`neo4j-migrations migrate` CLI), Nextcloud (`occ upgrade` via exec or Job), Zipkin (SQL DDL applied before deploy), APISIX (no tooling; manual etcd data transformation)
No migration needed	Cortex (schema versioned in YAML config; new period appended and deployed, old data untouched)

† Matrix Synapse has no official Helm chart from Element; the widely-used community chart (ananace/matrix-synapse) relies on in-process startup migration.

Part 1: Relational

1A. External Migration Frameworks

Projects	Language	Migration Framework	Trigger
GitLab, Mastodon, Discourse	Ruby	Rails ActiveRecord	GitLab: dedicated k8s Job (Helm). Mastodon: manual two-phase CLI; no official Helm chart. Discourse: launcher script runs `rake db:migrate` during rebuild; no official Helm chart.
Sentry, Zulip, NetBox	Python	Django Migrations	Sentry: `sentry upgrade` CLI (acquires distributed lock; post-deployment migrations must be run separately); official self-hosted is docker-compose only, no official Helm chart. Zulip: `scripts/upgrade-zulip` script; no official Helm chart, typically deployed on VMs. NetBox: container entrypoint script runs `manage.py migrate` on container start (netbox-docker); no official Helm chart.
Airflow, Superset	Python	Alembic	Both: dedicated k8s Job as Helm post-install/post-upgrade hook.
Ghost, Backstage	JavaScript, TypeScript	Knex.js	Both: app code calls migration runner on startup. Both have official Helm charts (Bitnami for Ghost, backstage/charts for Backstage); migrations run in-process at pod startup, no separate job.
Keycloak, Metabase	Java, Clojure	Liquibase	Both: app code calls Liquibase on startup. Keycloak: `DefaultJpaConnectionProviderFactory`; official Helm chart (Bitnami) and k8s Operator exist, auto-migrates at pod startup. Metabase: `setup-db!` (custom Clojure macros wrap Liquibase changesets); no official Helm chart.
Lemmy	Rust	Diesel	App code calls `run_pending_migrations()` on startup (before pool is returned). No official Helm chart; typically deployed via docker-compose.
Gitea	Go	XORM	Official Helm chart exists; init container explicitly runs `gitea migrate` before the main container starts (not relying on auto-migration). `AUTO_MIGRATION=false` can disable the in-process fallback.
Nextcloud	PHP	Doctrine DBAL	`occ upgrade` CLI or web-based updater; not automatic. Official Helm chart exists (nextcloud/helm); init containers only wait for DB readiness. `occ upgrade` must be run manually (e.g., exec into pod).

1B. Custom Migration Systems

Projects	Language	Migration Approach	Trigger
Grafana, Mattermost	Go	Custom Go	Both: app code calls migration runner on startup. Both have official Helm charts (grafana-community/helm-charts, mattermost/mattermost-helm); migrations run in-process at pod startup, no separate job. Mattermost also has an offline `mattermost db migrate` CLI with `--dry-run`.
WordPress, MediaWiki	PHP	Custom PHP	WordPress: app code runs on first admin page HTTP request after update; Bitnami Helm chart exists, auto-migration works in k8s. MediaWiki: manual `php maintenance/update.php`; no official Helm chart (Wikimedia uses an internal helmfile).
Odoo, Parse Server	Python, JavaScript	Declarative + scripts	Odoo: `odoo -u` CLI; Bitnami Helm chart exists, migration triggered at pod startup via env var. Parse Server: app code runs schema reconciliation on startup; Bitnami chart available, auto-migrates at pod startup.
Matrix Synapse	Python	Custom Python	App code applies delta scripts on startup (main process only; worker processes refuse to start if schema is behind). No official Helm chart from Element; the widely-used community chart (ananace/matrix-synapse) also relies on in-process startup migration.
Temporal, Kong	Go, Lua	Custom multi-DB	Both: dedicated CLI tool (`temporal-sql-tool update-schema`, `kong migrations up`) + dedicated k8s Job in Helm chart.
Zipkin, ThingsBoard	Java	Custom multi-DB	Zipkin: manual SQL file application before starting the server; Bitnami Helm chart exists but provides no migration automation. ThingsBoard: official Helm chart with a dedicated `initializedb` k8s Job for fresh installs; upgrades require running a separate pod with `UPGRADE_TB=true`.
Firefox, Chromium, Signal Desktop	C++, C++, TypeScript	Desktop SQLite	App code runs sequential version chain on startup. Desktop applications; Kubernetes not applicable.

Part 2: Non-Relational Only

2A. External Migration Frameworks

Projects	Language	Migration Framework	Trigger
Appsmith (server-side)	Java	Mongock (MongoDB)	App code runs migrations on startup via Spring Boot auto-configuration (`MongockInitializingBeanRunner`). Official Helm chart exists; init containers only wait for dependencies (MongoDB, Redis) to be ready — migrations run in the main application process.
FDB Record Layer	Java	(framework itself, powers iCloud)	Programmatic — `FDBRecordStore.open()` checks stored metadata version on each store open. A library, not a deployable service; Kubernetes not applicable.
Neo4j-Migrations	Java	(tool itself)	`neo4j-migrations migrate` CLI, or app code on startup via Spring Boot `InitializingBean`. Neo4j has an official Helm chart (neo4j/helm-charts); this tool is not bundled in it and must be run separately (e.g., as a k8s Job).

2B. Custom Migration Systems

Projects	Language	Migration Approach	Trigger
Kubernetes, Calico, Vitess, APISIX	Go, Go, Go, Lua	etcd / protobuf	K8s: `StorageVersionMigration` CRD triggers an in-cluster controller. Calico: operator-run `calico-upgrade` CLI tool. Vitess: topology schema evolves via additive protobuf field changes — no migration tooling needed. APISIX: fully manual, no tooling provided.
Jaeger	Go	Cassandra CQL	Dedicated k8s Job using `jaeger-cassandra-schema` Docker image, run before deploying Jaeger.
Rocket.Chat, Appsmith (DSL), Graylog	TypeScript, TypeScript, Java	MongoDB	Rocket.Chat/Graylog: app code runs migrations on startup; both have official Helm charts (RocketChat/helm-charts, Graylog2/graylog-helm), migrations run in-process at pod startup. Appsmith DSL: browser-side on every page load (migrations never written back to server); Kubernetes not applicable.
RxDB, CHT Core	TypeScript, JavaScript	CouchDB / offline-first	RxDB: client-side library; runs migrations per-device when a collection is opened; Kubernetes not applicable. CHT: app-level migrations run as part of API server startup; cluster-level migration requires a separate manual Docker tool; no official k8s support, deployed via docker-compose.
Cortex, Signal Server	Go, Java	DynamoDB	Cortex: time-partitioned schema config — no data rewrite, old tables coexist; official Helm chart exists (cortex-helm-chart), schema changes are config-file updates with no migration job. Signal Server: no explicit migration; new fields written into JSON blob on next update; tables provisioned via IaC; not publicly self-hosted, no Helm chart.
Dgraph	Go	Graph DB	`POST /admin` endpoint or `dgraph live --schema`; async background goroutine reindexes affected predicates. Official Helm chart exists (dgraph-io/charts); schema updates are pushed to running pods via Admin API.

1A. Relational Database Support: Using External Migration Frameworks

Rails ActiveRecord Migrations

GitLab

Background migrations, batched migrations, migration helpers, extensive written policies for safe migrations at scale. Uses a custom Gitlab::Database::Migration base class rather than stock ActiveRecord.

Trigger: Dedicated Kubernetes Job in the official Helm chart (charts/gitlab/charts/migrations/). The job runs /scripts/db-migrate and must complete before web/Sidekiq pods roll out. Migrations never run automatically on app startup. (Helm chart)

Example — creating a partitioned table with sparse indexes (source):

class CreateWorkItemTransitions < Gitlab::Database::Migration[2.3]
  milestone '18.3'

  def up
    create_table :work_item_transitions, id: false do |t|
      t.bigint :work_item_id, primary_key: true, default: nil
      t.bigint :namespace_id, null: false
      t.bigint :moved_to_id, null: true
      t.index :moved_to_id, where: 'moved_to_id IS NOT NULL',
        name: 'index_work_item_transitions_on_moved_to_id'
    end
  end
end

Mastodon

Federated: thousands of independently-upgraded instances, so migrations must be safe across version skew.

Trigger: NOT automatic. Admins run migrations manually in two phases — SKIP_POST_DEPLOYMENT_MIGRATIONS=true rails db:migrate (before restart) then rails db:migrate (after). No official Helm chart.

Example — data migration translating theme settings to new key/value pairs (source):

class MigrateUserTheme < ActiveRecord::Migration[8.0]
  disable_ddl_transaction!
  class User < ApplicationRecord; end

  def up
    User.where.not(settings: nil).find_each do |user|
      settings = JSON.parse(user.attributes_before_type_cast['settings'])
      case settings['theme']
      when 'default'
        settings['web.color_scheme'] = 'dark'
      when 'mastodon-light'
        settings['web.color_scheme'] = 'light'
      end
      user.update_column('settings', JSON.generate(settings))
    end
  end
end

Discourse

Mature self-hosted forum with a disciplined migration history over 10+ years.

Trigger: discourse_docker launcher runs bundle exec rake db:migrate during ./launcher rebuild app. Auto-migration on every boot is off by default, controlled by MIGRATE_ON_BOOT env var.

Example — DDL + data backfill in one migration (source):

class CreateCategoryApprovalGroups < ActiveRecord::Migration[8.0]
  def up
    create_table :category_posting_review_groups do |t|
      t.integer :post_type, null: false
      t.integer :category_id, null: false
      t.integer :group_id, null: false
      t.timestamps null: false
    end
    # Backfill from existing category_settings
    execute(<<~SQL)
      INSERT INTO category_posting_review_groups (post_type, permission, category_id, group_id, created_at, updated_at)
      SELECT 0, 1, cs.category_id, 0, NOW(), NOW()
      FROM category_settings cs WHERE cs.require_topic_approval = true
    SQL
  end
end

Django Migrations

Sentry

Operates at massive scale; has written extensively about the pain of running Django migrations on huge tables.

Trigger: sentry upgrade CLI (called via docker compose run --rm web upgrade in self-hosted). Acquires a distributed lock. Migrations marked is_post_deployment = True are skipped and must be run manually in a separate step.

Example — post-deployment data backfill with Redis progress checkpointing (source):

class Migration(CheckedMigration):
    is_post_deployment = True  # won't auto-run during deploy

    operations = [
        migrations.RunPython(
            backfill_group_open_periods,
            migrations.RunPython.noop,
            hints={"tables": ["sentry_groupopenperiod"]},
        ),
    ]

Zulip

Well-engineered chat server, known for thoughtful code quality and contributor documentation.

Trigger: scripts/upgrade-zulip → upgrade-zulip-stage-3 stops the server, runs manage.py migrate --noinput, then restarts. Not automatic on startup.

Example — data-repair migration using raw SQL lateral join across JSONB audit log (source):

class Migration(migrations.Migration):
    atomic = False  # outside a transaction for large repair work

    operations = [
        migrations.RunPython(
            recreate_missing_realmemoji,
            elidable=True,
        ),
    ]

NetBox

Network infrastructure management tool widely used by network teams. Large plugin ecosystem extends the schema.

Trigger: Auto on container startup. netbox-docker entrypoint checks manage.py migrate --check and runs manage.py migrate --no-input if unapplied migrations exist.

Example — AddField + RunPython backfill + RemoveField in one migration (source):

operations = [
    migrations.AddField(
        model_name='vminterface', name='primary_mac_address',
        field=models.OneToOneField(null=True, on_delete=SET_NULL, to='dcim.macaddress'),
    ),
    migrations.RunPython(code=populate_mac_addresses, reverse_code=migrations.RunPython.noop),
    migrations.RemoveField(model_name='vminterface', name='mac_address'),
]

Alembic (SQLAlchemy)

Airflow

Apache project, widely deployed in very different operator environments.

Trigger: Dedicated Kubernetes Job (migrate-database-job.yaml) as a Helm post-install/post-upgrade hook running airflow db migrate. All other pods have a wait-for-airflow-migrations init container that blocks until the job completes. (Helm chart)

Example (source):

revision = "53ff648b8a26"
down_revision = "a5a3e5eb9b8d"

def upgrade():
    op.create_table(
        "revoked_token",
        sa.Column("jti", sa.String(32), primary_key=True, nullable=False),
        sa.Column("exp", UtcDateTime, nullable=False, index=True),
    )

def downgrade():
    op.drop_table("revoked_token")

Apache Superset

Apache data visualization platform, large contributor base.

Trigger: Dedicated Kubernetes Job (init-job.yaml) as a Helm post-install/post-upgrade hook running superset db upgrade. (Helm chart)

Example (source):

def upgrade():
    op.add_column("dbs", sa.Column("password", sa.LargeBinary(), nullable=True))

def downgrade():
    op.drop_column("dbs", "password")

Knex.js Migrations

Ghost

Popular blogging platform. Uses knex-migrator (a wrapper around Knex).

Trigger: Auto on every startup. boot.js → DatabaseStateManager checks knexMigrator.isDatabaseOK() and calls knexMigrator.migrate() if needed. (source)

Example (source):

const {addTable} = require('../../utils');

module.exports = addTable('members_created_events', {
    id:               {type: 'string', maxlength: 24, nullable: false, primary: true},
    created_at:       {type: 'dateTime', nullable: false},
    member_id:        {type: 'string', maxlength: 24, nullable: false,
                       references: 'members.id', cascadeDelete: true},
    attribution_id:   {type: 'string', maxlength: 24, nullable: true},
    source:           {type: 'string', maxlength: 50, nullable: false}
});

Backstage

Spotify-created developer portal. Plugin architecture means migrations come from many independent teams.

Trigger: Auto on startup. Each plugin’s CatalogBuilder.build() calls applyDatabaseMigrations(dbClient) → knex.migrate.latest() before any database objects are constructed. (source)

Example (source):

exports.up = async function up(knex) {
  await knex.schema.createTable('entities_relations', table => {
    table.comment('All relations between entities in the catalog');
    table.uuid('originating_entity_id')
      .references('id').inTable('entities').onDelete('CASCADE').notNullable();
    table.string('type').notNullable();
    table.string('target_full_name').notNullable();
    table.primary(['source_full_name', 'type', 'target_full_name']);
  });
};

Liquibase (XML/YAML Changelogs)

Keycloak

Red Hat-backed identity server. Liquibase changelogs declared in XML.

Trigger: Auto on startup. DefaultJpaConnectionProviderFactory calls LiquibaseJpaUpdaterProvider.update() → liquibase.update(). No special Helm/Operator handling — pods auto-migrate. (source)

Example (source):

Metabase

BI tool written in Clojure. Uses Liquibase under the hood with custom Clojure macros (define-migration, define-reversible-migration) layered on top.

Trigger: Auto on startup. setup-db! → run-schema-migrations! → Liquibase’s migrate-up-if-needed!. (source)

Example — custom Clojure migration macro wrapping Liquibase (source):

(define-migration DeleteAbandonmentEmailTask
  (custom-migrations.util/with-temp-schedule! [scheduler]
    (qs/delete-trigger scheduler
      (triggers/key "metabase.task.abandonment-emails.trigger"))
    (qs/delete-job scheduler
      (jobs/key "metabase.task.abandonment-emails.job"))))

Diesel ORM Migrations (Rust)

Lemmy

Federated Reddit alternative. Diesel generates migration SQL files.

Trigger: Auto on every startup. build_db_pool() calls run_pending_migrations() synchronously before the pool is returned. A standalone lemmy_diesel_utils binary also allows running migrations offline. (source)

Example (source):

CREATE TABLE private_message (
    id serial PRIMARY KEY,
    creator_id int REFERENCES user_ ON UPDATE CASCADE ON DELETE CASCADE NOT NULL,
    recipient_id int REFERENCES user_ ON UPDATE CASCADE ON DELETE CASCADE NOT NULL,
    content text NOT NULL,
    deleted boolean DEFAULT FALSE NOT NULL,
    read boolean DEFAULT FALSE NOT NULL,
    published timestamp NOT NULL DEFAULT now(),
    updated timestamp
);

XORM-Based Migrations (Go)

Gitea

Git hosting platform. Migrations are Go functions using XORM for cross-database compatibility.

Trigger: Auto on startup (unless AUTO_MIGRATION=false in app.ini). InitDBEngine() → Migrate() compares the Version table against ExpectedDBVersion(). (source)

Example — typical pattern: define minimal struct, call SyncWithOptions (source):

func AddExclusiveOrderColumnToLabelTable(x *xorm.Engine) error {
    type Label struct {
        ExclusiveOrder int `xorm:"DEFAULT 0"`
    }
    _, err := x.SyncWithOptions(xorm.SyncOptions{
        IgnoreConstrains: true,
        IgnoreIndices:    true,
    }, new(Label))
    return err
}

Doctrine DBAL-Based Migrations (PHP)

Nextcloud

Huge self-hosted user base. Plugin ecosystem means third-party apps also run their own migrations.

Trigger: occ upgrade CLI command or web-based updater. Updater::doUpgrade() instantiates MigrationService and calls migrate(). Not automatic on every page load. (source)

Example — using Doctrine’s schema abstraction to change a column type (source):

class Version34000Date20260318095645 extends SimpleMigrationStep {
    public function changeSchema(IOutput $output, Closure $schemaClosure, array $options): ?ISchemaWrapper {
        $schema = $schemaClosure();
        if ($schema->hasTable('jobs')) {
            $table = $schema->getTable('jobs');
            $argumentColumn = $table->getColumn('argument');
            if ($argumentColumn->getType() !== Type::getType(Types::TEXT)) {
                $argumentColumn->setType(Type::getType(Types::TEXT));
                return $schema;
            }
        }
        return null; // idempotency guard — no change needed
    }
}

1B. Relational Database Support: Custom / In-House Migration Systems

Custom Go Migration Systems

Grafana

Must keep migrations working across 3 DB backends (SQLite, PostgreSQL, MySQL). Uses a Go DSL with per-dialect SQL dispatch.

Trigger: Auto on every startup. ProvideService() calls s.Migrate() synchronously — the server won’t start if migration fails. Official Helm chart has no migration job; relies entirely on auto-migration. (source)

Example — per-dialect SQL in one migration (source):

mg.AddMigration("Update uid column values in alert_notification", new(RawSQLMigration).
    SQLite("UPDATE alert_notification SET uid=printf('%09d',id) WHERE uid IS NULL;").
    Postgres("UPDATE alert_notification SET uid=lpad('' || id::text,9,'0') WHERE uid IS NULL;").
    Mysql("UPDATE alert_notification SET uid=lpad(id,9,'0') WHERE uid IS NULL;"))

Mattermost

Enterprise messaging. Switched from a custom Go DSL to morph (their own migration engine) with embedded .up.sql/.down.sql files.

Trigger: Auto on startup by default. sqlstore.New() calls store.migrate(). Also has an offline mattermost db migrate CLI with --dry-run and --save-plan flags for zero-downtime deploys. Helm chart has no migration job. (source)

Example (source):

UPDATE AccessControlPolicies AS p
SET Name = LEFT(p.Name, 128 - LENGTH(' (' || p.ID || ')')) || ' (' || p.ID || ')'
FROM (
    SELECT ID, Name, ROW_NUMBER() OVER (PARTITION BY Name ORDER BY CreateAt ASC) AS rn
    FROM AccessControlPolicies WHERE Type = 'parent'
) AS dupes
WHERE p.ID = dupes.ID AND dupes.rn > 1;

CREATE UNIQUE INDEX IF NOT EXISTS idx_accesscontrolpolicies_name_type
    ON AccessControlPolicies (Name, Type) WHERE Type = 'parent';

Custom PHP Systems

WordPress

Famously has no migration framework. Uses dbDelta() for schema and version-numbered upgrade functions for data. 40%+ of the web runs on this.

Trigger: Auto on first admin page load after update. wp-admin/admin.php checks get_option('db_version') vs $wp_db_version; if they differ, redirects to upgrade.php which calls each version-specific upgrade_NNN() function.

Example — version-specific data migration (source):

function upgrade_700() {
    global $wp_current_db_version, $wpdb;
    if ( $wp_current_db_version < 61644 ) {
        $wpdb->update(
            $wpdb->usermeta,
            array( 'meta_value' => 'modern' ),
            array( 'meta_key' => 'admin_color', 'meta_value' => 'fresh' )
        );
    }
}

MediaWiki

Powers Wikipedia. Has its own maintenance script system with separate SQL files per database engine.

Trigger: php maintenance/update.php must be run manually after deploying new code. Wikimedia runs this as a k8s Job in their deployment pipeline via the scap tool. Never auto-runs on web requests.

Example — batch data migration merging a temp table into the main table (source):

protected function doDBUpdates() {
    $dbw = $this->getDB( DB_PRIMARY );
    if ( !$dbw->tableExists( 'revision_comment_temp', __METHOD__ ) ) {
        $this->output( "revision_comment_temp does not exist, nothing to do.\n" );
        return true;
    }
    // batch-copies revcomment_comment_id → rev_comment_id
    $dbw->newUpdateQueryBuilder()
        ->update( 'revision' )
        ->set( [ 'rev_comment_id' => $row->revcomment_comment_id ] )
        ->where( [ 'rev_id' => $row->rev_id ] )
        ->caller( __METHOD__ )->execute();
}

Declarative / ORM-Diffing + Explicit Migration Scripts

Odoo

ERP with 10,000+ modules. ORM handles additive changes declaratively; renames, transforms, and restructuring require explicit pre/post/end migration scripts. Major version upgrades use Odoo SA’s proprietary upgrade service or the community OpenUpgrade project (~120 scripts per major version).

Trigger: odoo -u or -u all. MigrationManager in loading.py discovers migrations//pre-*.py and post-*.py files via glob and exec_module()s each script’s migrate(cr, version) function.

Example — pre-migrate script changing FK constraints (source):

def migrate(cr, version):
    cr.execute("""
        SELECT value::int FROM ir_config_parameter WHERE key = 'analytic.project_plan'
    """)
    [project_plan_id] = cr.fetchone()
    cr.execute("SELECT id FROM account_analytic_plan WHERE id != %s AND parent_id IS NULL",
               [project_plan_id])
    plan_ids = [r[0] for r in cr.fetchall()]
    for column in [f"x_plan{id_}_id" for id_ in plan_ids]:
        sql.drop_constraint(cr, 'account_analytic_line', f'account_analytic_line_{column}_fkey')
        sql.add_foreign_key(cr, 'account_analytic_line', column,
                            'account_analytic_account', 'id', 'restrict')

Parse Server

Backend-as-a-Service (originally Facebook, 21k stars). No numbered migration scripts — declarative schema reconciliation at startup.

Trigger: Auto on startup. ParseServer.start() adds new DefinedSchemas(schema, config).execute() to startupPromises. Server won’t accept traffic until reconciliation completes (or process.exit(1) in production on failure). (source)

Example — schema reconciliation engine (source):

async executeMigrations() {
  await this.createDeleteSession();
  const schemaController = await this.config.database.loadSchema();
  this.allCloudSchemas = await schemaController.getAllClasses();
  await Promise.all(
    this.localSchemas.map(async localSchema => this.saveOrUpdate(localSchema))
  );
  this.checkForMissingSchemas();
  await this.enforceCLPForNonProvidedClass();
}

Custom Python (Delta Scripts)

Matrix Synapse

Federated messaging server. Numbered SQL/Python delta scripts. Federation means different homeservers run different versions simultaneously.

Trigger: Auto on startup (main process only). prepare_database() reads schema_version and applies all pending delta scripts. Worker processes refuse to start if schema is unmigrated — only the main process is permitted to apply changes. (source)

Example (source):

CREATE TABLE sliding_sync_connection_lazy_members (
    connection_key BIGINT NOT NULL
        REFERENCES sliding_sync_connections(connection_key) ON DELETE CASCADE,
    room_id TEXT NOT NULL,
    user_id TEXT NOT NULL,
    last_seen_ts BIGINT NOT NULL
);

CREATE UNIQUE INDEX sliding_sync_connection_lazy_members_idx
    ON sliding_sync_connection_lazy_members (connection_key, room_id, user_id);

Custom Versioned Scripts (Multi-DB, including non-relational)

Temporal

Workflow orchestration engine. Versioned SQL scripts per database backend.

Trigger: temporal-sql-tool update-schema CLI. In k8s, runs as a dedicated Kubernetes Job in the official Helm chart (charts/temporal/templates/server-job.yaml). (Helm chart)

Example (source):

CREATE TABLE visibility_tasks(
  shard_id INTEGER NOT NULL,
  task_id BIGINT NOT NULL,
  data BYTEA NOT NULL,
  data_encoding VARCHAR(16) NOT NULL,
  PRIMARY KEY (shard_id, task_id)
);

Kong

Popular API gateway (43k stars). Custom Lua migration framework. Deprecated Cassandra in 2.7 and removed it in 3.4. Now PostgreSQL-only.

Trigger: kong migrations bootstrap (fresh install) / kong migrations up + kong migrations finish (upgrades). In k8s, runs as a dedicated Kubernetes Job in the official Helm chart. (Helm chart)

Example (source):

return {
  postgres = {
    up = [[
      DO $$
      BEGIN
        ALTER TABLE IF EXISTS ONLY "plugins" ADD "protocols" TEXT[];
      EXCEPTION WHEN DUPLICATE_COLUMN THEN
        -- Do nothing, accept existing state
      END;
      $$;

      CREATE TABLE IF NOT EXISTS "tags" (
        entity_id    UUID    PRIMARY KEY,
        entity_name  TEXT,
        tags         TEXT[]
      );
    ]],
  },
}

Zipkin

The original distributed tracing system (17k stars, since 2012). Bundles versioned CQL and SQL schema files.

Trigger: Schema must be applied manually before running Zipkin — mysql < mysql.sql. Zipkin does not auto-apply schema on startup; it introspects existing tables but does not create or alter them. (docs)

Example (source):

CREATE TABLE IF NOT EXISTS zipkin_spans (
  `trace_id_high` BIGINT NOT NULL DEFAULT 0,
  `trace_id`      BIGINT NOT NULL,
  `id`            BIGINT NOT NULL,
  `name`          VARCHAR(255) NOT NULL,
  `start_ts`      BIGINT,
  `duration`      BIGINT,
  PRIMARY KEY (`trace_id_high`, `trace_id`, `id`)
) ENGINE=InnoDB ROW_FORMAT=COMPRESSED CHARACTER SET=utf8;

ThingsBoard

IoT platform (21k stars). Uses Cassandra for time-series telemetry, PostgreSQL for relational data.

Trigger: upgrade.sh script invokes ThingsboardInstallApplication (a separate Spring Boot entry point, not the normal server) with --fromVersion flag. Docker: docker compose run --rm -e UPGRADE_TB=true. (source)

Example (source):

ALTER TABLE calculated_field
  ADD COLUMN IF NOT EXISTS additional_info varchar;

Desktop SQLite Migrations

These projects run migrations on end-user machines — across hundreds of millions of installations, with no DBA watching, no rollback capability, and users who may skip many versions between upgrades.

Firefox

Migrates bookmarks, history, cookies, permissions databases in C++/Rust.

Trigger: Auto on startup. InitSchema() reads GetSchemaVersion() and runs sequential MigrateVNUp() functions inside a transaction. Failure prevents Places from loading.

Example — adding a column and backfilling it (source):

nsresult Database::MigrateV54Up() {
  nsCOMPtr stmt;
  nsresult rv = mMainConn->CreateStatement(
      "SELECT expire_ms FROM moz_icons_to_pages"_ns, getter_AddRefs(stmt));
  if (NS_FAILED(rv)) {
    rv = mMainConn->ExecuteSimpleSQL(
        "ALTER TABLE moz_icons_to_pages "
        "ADD COLUMN expire_ms INTEGER NOT NULL DEFAULT 0 "_ns);
    NS_ENSURE_SUCCESS(rv, rv);
  }
  rv = mMainConn->ExecuteSimpleSQL(
      "UPDATE moz_icons_to_pages SET expire_ms = "
      "strftime('%s','now','localtime','start of day','utc') * 1000 "
      "WHERE expire_ms = 0 "_ns);
  return NS_OK;
}

Chromium

Same problem as Firefox, different implementation. Sequential if (cur_version == N) blocks.

Trigger: Auto on startup. HistoryDatabase::Init() → EnsureCurrentVersion() runs each version block up to the current version (70+). Version too new → INIT_TOO_NEW; migration failure → INIT_FAILURE.

Example (source):

if (cur_version == 15) {
  if (!db_.Execute("DROP TABLE starred") || !DropStarredIDFromURLs())
    return LogMigrationFailure(15);
  ++cur_version;
  std::ignore = meta_table_.SetVersionNumber(cur_version);
  std::ignore = meta_table_.SetCompatibleVersionNumber(
      std::min(cur_version, kCompatibleVersionNumber));
}

Signal Desktop

Encrypted SQLite database (SQLCipher). Migrations in TypeScript.

Trigger: Auto on startup. ts/sql/Server.node.ts opens the encrypted DB, then calls updateSchema(db, logger) which iterates SCHEMA_VERSIONS and applies each pending migration in a transaction. Only the primary worker runs migrations. (source)

Example (source):

import type { Database } from '@signalapp/sqlcipher';

export default function updateToSchemaVersion1090(db: Database): void {
  db.exec(`
    CREATE INDEX reactions_messageId ON reactions (messageId);
    CREATE INDEX storyReads_storyId ON storyReads (storyId);
  `);
}

2A. No Relational Database Support: Using External Migration Frameworks

Very few non-relational projects use an external migration framework — the ecosystem of reusable tooling is much thinner than in the relational world.

Mongock (MongoDB Migration Framework for Java)

Appsmith (server-side)

Low-code platform. Server-side uses Mongock with @ChangeUnit annotations. Also has a separate client-side DSL migration system (see 2B).

Trigger: Auto on Spring Boot startup. Mongock runs as a MongockInitializingBeanRunner bean, scanning for @ChangeUnit classes and executing them in order. Helm chart has no init container for migrations — they run inside the main app container. (source)

Example — converting a policies array to a keyed policyMap across 22 collections (source):

@ChangeUnit(order = "059", id = "policy-set-to-policy-map")
public class Migration059PolicySetToPolicyMap {
    private final ReactiveMongoTemplate mongoTemplate;

    @Execution
    public void execute() {
        Mono.whenDelayError(CE_COLLECTION_NAMES.stream()
                .map(c -> executeForCollection(mongoTemplate, c))
                .toList())
            .block();
    }
    // Uses ArrayToObject aggregation to transform policies[] → policyMap{}
}

FoundationDB Record Layer (Protobuf Schema Evolution)

FoundationDB Record Layer

Apple’s Java library powering iCloud/CloudKit — billions of independent databases sharing thousands of schemas. SIGMOD 2019 paper.

Trigger: Programmatic. Library consumers call FDBRecordStore.Builder#open() or #checkVersion(). A UserVersionChecker callback compares the stored metadata version in the database header against the current code’s metadata version and decides how to proceed. (source)

Example — adding a field to a record type via MetaDataProtoEditor (source):

public static void addField(@Nonnull RecordMetaDataProto.MetaData.Builder metaDataBuilder,
                            @Nonnull String recordType,
                            @Nonnull DescriptorProtos.FieldDescriptorProto field) {
    DescriptorProtos.DescriptorProto.Builder messageType =
        findMessageTypeByName(metaDataBuilder.getRecordsBuilder(), recordType);
    if (messageType == null) {
        throw new MetaDataException("Record type " + recordType + " does not exist");
    }
    messageType.addField(field);
}

And the evolution validator (source):

public void validate(@Nonnull RecordMetaData oldMetaData, @Nonnull RecordMetaData newMetaData) {
    if (oldMetaData.getVersion() > newMetaData.getVersion()) {
        throw new MetaDataException("new meta-data does not have newer version");
    }
    validateUnion(oldMetaData.getUnionDescriptor(), newMetaData.getUnionDescriptor());
    validateRecordTypes(oldMetaData, newMetaData, getTypeRenames(...));
    validateCurrentAndFormerIndexes(oldMetaData, newMetaData, typeRenames);
}

Neo4j-Migrations (Flyway-inspired for Graph DBs)

Neo4j-Migrations

Canonical migration tool for the Neo4j ecosystem. Migrations are Cypher scripts or Java classes.

Trigger: Two paths: (1) neo4j-migrations migrate CLI, (2) Spring Boot auto-configuration — MigrationsInitializer implements InitializingBean and calls migrations.apply(true) in afterPropertiesSet(). (source)

Example — Cypher migration file, Flyway naming convention (source):

MATCH (n:BrokenData) DETACH DELETE n;

The migration runner (source):

private void apply0(List migrations) {
    MigrationChain chain = this.chainBuilder.buildChain(this.context, migrations);
    for (Migration migration : IterableMigrations.of(this.config, migrations, optionalStop)) {
        migration.apply(this.context);
        recordApplication(chain.getUsername(), previousVersion, migration, executionTime);
    }
}

2B. No Relational Database Support: Custom / In-House Migration Systems

Almost every non-relational project has built its own migration infrastructure.

KV Store / Protobuf Schema Evolution (etcd)

Kubernetes

Objects stored as protobufs in etcd. When the storage version for a resource type changes, existing objects need re-encoding.

Trigger: Create a StorageVersionMigration CRD. The kube-storage-version-migrator controller watches for these CRDs and does a paginated no-op PUT on every object, causing the API server to re-serialize in the new storage version. Deployed as a standalone in-cluster controller.

Example — API version conversion function for Deployments (source):

func Convert_v1_Deployment_To_apps_Deployment(in *appsv1.Deployment, out *apps.Deployment, s conversion.Scope) error {
    if err := autoConvert_v1_Deployment_To_apps_Deployment(in, out, s); err != nil {
        return err
    }
    // Deprecated rollbackTo field → annotation for roundtrip
    if revision := in.Annotations[appsv1.DeprecatedRollbackTo]; revision != "" {
        revision64, _ := strconv.ParseInt(revision, 10, 64)
        out.Spec.RollbackTo = &apps.RollbackConfig{Revision: revision64}
        delete(out.Annotations, appsv1.DeprecatedRollbackTo)
    }
    return nil
}

Calico

Underwent a major data model overhaul from v2 to v3. Built a dedicated calico-upgrade migration tool.

Trigger: calico-upgrade start CLI. Operator-initiated one-time migration with four phases: dry-run, start (pauses networking, converts all v1 objects to v3), complete, abort. (source)

Example — policy name conversion for etcd storage (source):

func convertPolicyNameForStorage(name string) string {
    if strings.HasPrefix(name, "knp.") {
        return name // Kubernetes-native policies keep their prefix
    }
    return "default." + name // Calico policies stored under "default" tier
}

Vitess

CNCF Graduated MySQL clustering system (powers PlanetScale, Slack, GitHub). Stores topology metadata (keyspaces, shards, tablets, routing rules) as proto3 binary blobs in etcd. Schema evolution happens via standard protobuf rules — fields are only added, never removed or reordered — so stored objects remain readable across versions without any migration step. The topo2topo tool exists to copy topology between different backends (e.g., ZooKeeper → etcd) but this is a backend replacement, not a schema migration.

Trigger: No migration tooling needed. The protobuf encoding is forward- and backward-compatible by construction.

Example — protobuf-encoded topology object read from etcd (source):

func CopyKeyspaces(ctx context.Context, fromTS, toTS *topo.Server, parser *sqlparser.Parser) error {
    keyspaces, err := fromTS.GetKeyspaces(ctx)
    for _, keyspace := range keyspaces {
        ki, err := fromTS.GetKeyspace(ctx, keyspace)
        if err := toTS.CreateKeyspace(ctx, keyspace, ki.Keyspace); err != nil {
            if topo.IsErrType(err, topo.NodeExists) {
                log.Warn(fmt.Sprintf("keyspace %v already exists", keyspace))
            }
        }
    }
    return nil
}

Apache APISIX

Cloud-native API gateway. All dynamic runtime config (routes, upstreams, plugins, SSL certs) stored in etcd; static node config (listen ports, worker processes) remains in config.yaml on disk. The 2.x → 3.0 upgrade had incompatible etcd data structure changes with no automated migration.

Trigger: Entirely manual. etcdctl snapshot save, then either write custom scripts to transform JSON values in-place, or reconfigure from scratch via the 3.0 Admin API. No migration tooling provided. (docs)

Example — the breaking disable field relocation:

// 2.15.x — "disable" is top-level in each plugin
{ "plugins": { "limit-count": { "count": 2, "disable": true } } }

// 3.0.0 — "disable" must be nested under "_meta"
{ "plugins": { "limit-count": { "count": 2, "_meta": { "disable": true } } } }

Cassandra Schema Migrations (Cassandra-only projects)

Jaeger

CNCF Graduated distributed tracing. Versioned CQL templates parameterized by environment variables.

Trigger: create.sh shell script performs variable substitution and pipes CQL to cqlsh. In k8s, runs as a one-time Kubernetes Job using the jaegertracing/jaeger-cassandra-schema Docker image before deploying Jaeger. (k8s manifest)

Example (source):

CREATE TYPE IF NOT EXISTS ${keyspace}.keyvalue (
    key          text,
    value_type   text,
    value_string text,
    value_bool   boolean,
    value_long   bigint,
    value_double double,
    value_binary blob
);

CREATE TABLE IF NOT EXISTS ${keyspace}.traces (
    trace_id        blob,
    span_id         bigint,
    span_hash       bigint,
    operation_name  text,
    start_time      bigint,
    duration        bigint,
    PRIMARY KEY (trace_id, span_id, span_hash)
);

MongoDB Document Migrations

Rocket.Chat

Team chat platform (45k stars). 300+ migrations. Control document tracks version + lock state.

Trigger: Auto on every startup. xrun.ts calls performMigrationProcedure() → migrateDatabase('latest'). All versioned migration modules (v293–v335) are imported at startup. (source)

Example (source):

import { Settings } from '@rocket.chat/models';
import { addMigration } from '../../lib/migrations';

addMigration({
    version: 309,
    name: 'Remove unused UI_Click_Direct_Message setting',
    async up() {
        await Settings.removeById('UI_Click_Direct_Message');
    },
});

Appsmith (client-side DSL)

Per-document version stamps. 94 sequential migration functions for widget DSL. Runs in the browser, not the server.

Trigger: On every page load. extractCurrentDSL() calls migrateDSL(currentDSL), which runs every if (version === N) block from the stored version up through 94. The upgraded DSL is never written back — migrations re-execute on every load. (source)

Example — migrating legacy styling enums to CSS tokens (source):

enum ButtonBorderRadiusTypes { SHARP = "SHARP", ROUNDED = "ROUNDED", CIRCLE = "CIRCLE" }
const THEMING_BORDER_RADIUS = { none: "0px", rounded: "0.375rem", circle: "9999px" };

export const migrateStylingPropertiesForTheming = (currentDSL: DSLWidget) => {
  // walks every widget, rewrites legacy enum-style borderRadius / boxShadow
  // to CSS token strings used by the theming system
};

Graylog

Log management (since 2010). 91 timestamped Java migration classes. Leader-gated.

Trigger: Auto on startup via ServerBootstrap.runMigrations(). Only runs on the leader node (checked via configuration.isLeader()). Three phases: PREFLIGHT, STANDARD, and ENFORCED_ON_ALL_NODES. No separate k8s job. (source)

Example (source):

public class V20190705071400_AddEventIndexSetsMigration extends Migration {
    @Override
    public ZonedDateTime createdAt() {
        return ZonedDateTime.parse("2019-07-05T07:14:00Z");
    }

    @Override
    public void upgrade() {
        ensureEventsStreamAndIndexSet("Events",
            "Stores events created by event definitions.",
            elasticsearchConfiguration.getDefaultEventsIndexPrefix(),
            Stream.DEFAULT_EVENTS_STREAM_ID, "All events");
    }
}

CouchDB / Offline-First Migrations

RxDB

Reactive JavaScript database for client-side apps. Each collection carries a schema version with migrationStrategies functions.

Trigger: Auto when a collection is opened (if autoMigrate: true, the default). createRxCollection() detects a lower stored schema version and calls migratePromise(). Runs in the browser per-device; awaits leader election in multi-instance databases. (source)

Example — migration strategies and the core iteration loop (source):

// Defining strategies at collection creation
migrationStrategies: {
  1: function(oldDoc) {
    oldDoc.time = new Date(oldDoc.time).getTime(); // string → unix
    return oldDoc;
  },
  2: function(oldDoc) {
    if (oldDoc.time < 1486940585) return null; // deletes document
    return oldDoc;
  }
}

// Core iteration in migration-helpers.ts
let nextVersion = docSchemaVersion + 1;
while (nextVersion <= collection.schema.version) {
    currentPromise = currentPromise.then(docOrNull =>
        runStrategyIfNotNull(collection, nextVersion, docOrNull));
    nextVersion++;
}

CHT Core (Community Health Toolkit)

CouchDB-based offline-first health apps used by tens of thousands of health workers in dozens of countries.

Trigger: Two-track. App-level migrations (in api/src/migrations/) auto-run on API startup — the server is unavailable (502) until complete. Cluster-level migrations (3.x → 4.x) require manually running the couchdb-migration Docker tool before upgrading. (docs)

Example — removing a field from CouchDB documents via bulkDocs (source):

module.exports = {
  name: 'remove-enabled-from-translation-docs',
  created: new Date('2025-09-01'),
  run: async () => {
    const translationDocs = await translations.getTranslationDocs();
    translationDocs.forEach(doc => delete doc.enabled);
    await db.medic.bulkDocs(translationDocs);
  }
};

DynamoDB Schema Evolution

Cortex

CNCF Prometheus long-term storage. Time-partitioned schema versioning — you never migrate old data.

Trigger: No data migration. Append a new PeriodConfig block to the YAML config with a future from: date and new schema: version. At runtime, SchemaForTime(timestamp) selects the correct config for each chunk. Old and new schema tables coexist indefinitely. (original PR)

Example — the schema dispatch function (now maintained in Grafana Loki, same code) (source):

type PeriodConfig struct {
    From   DayTime `yaml:"from"`
    Schema string  `yaml:"schema"` // e.g. "v10", "v11"
}

func (cfg SchemaConfig) SchemaForTime(t model.Time) (PeriodConfig, error) {
    for i := range cfg.Configs {
        if t >= cfg.Configs[i].From.Time &&
            (i+1 == len(cfg.Configs) || t < cfg.Configs[i+1].From.Time) {
            return cfg.Configs[i], nil
        }
    }
    return PeriodConfig{}, fmt.Errorf("no schema config found for time %v", t)
}

Signal Server

Backend for Signal Private Messenger. Uses DynamoDB as primary store. Schema evolution is implicit — most data lives inside a JSON blob attribute.

Trigger: No schema migration. New fields are added to the Account POJO and written into the D (data) attribute on next update. A per-item V (version) attribute provides optimistic locking. Table/GSI changes are provisioned externally via infrastructure-as-code, not application code. (source)

Example — optimistic locking on DynamoDB writes (source):

static final String ATTR_VERSION = "V";

// Every update atomically increments version and checks the condition
updateExpressionBuilder.append(" ADD #version :version_increment");

return new UpdateAccountSpec(accountTableName,
    Map.of(KEY_ACCOUNT_UUID, AttributeValues.fromUUID(account.getUuid())),
    attrNames, attrValues,
    updateExpressionBuilder.toString(),
    "attribute_exists(#number) AND #version = :version");  // conditional write

Graph Database Schema Evolution

Dgraph

When deploying a new GraphQL schema, Dgraph updates the schema in memory immediately but does not alter existing data — index rebuilds run asynchronously in the background.

Trigger: POST /admin with an updateGQLSchema mutation, or dgraph live --schema. The change propagates to all cluster nodes via Raft. If a predicate’s tokenizer changed, a background goroutine iterates all existing postings in Badger and writes new index entries. (source)

Example — schema mutation with conditional async index rebuild (source):

rebuild := posting.IndexRebuild{
    Attr: su.Predicate, StartTs: startTs,
    OldSchema: &old, CurrentSchema: su,
}

// Write new schema to memory immediately (queries see it now)
schema.State().Set(su.Predicate, rebuild.GetQuerySchema())

if rebuild.NeedIndexRebuild() {
    go buildIndexes(su, rebuild, closer) // async background reindex
} else {
    updateSchema(su, rebuild.StartTs)    // write to Badger, done
}

Openclaw is Spam, Like Any Other Automated Email

Jeremy — Mon, 23 Feb 2026 01:23:08 +0000

Open Source communities are trying to quickly adapt to the present rapid advances in technology. I would like to propose some clarity around something that should be common sense.

Automated emails are spam. They always have been. Openclaw (and whatever new thing surfaces this summer) is no different.

Policies saying automated emails/messages are banned – including anything AI generated – are not only common-sense policies, they aren’t even a change from how we’ve always worked. This includes automated comments on github issues, automated PRs, automated patch submissions, and even any kind of automated review. Copilot automated reviews, snyk, etc – are ok if-and-only-if it’s configured by the owners of the repo/project. Common sense.

Enforcement of these policies – more than ever – depends on trust and relationships. I do think, for example, that non-native-english-speakers should be allowed to use AI to help them check their english. Used responsibly, AI tools can help a lot with language learning! Your grammar checker is probably based on some kind of LLM anyway. But I’m saying that a human always presses the “send” button on the message, and this human is responsible for the words they sent. If moderators suspect automated messages, every open source project should have a policy they can cite for blocking/banning the account.

Tomas Vondra’s article “the AI inversion” is the latest of many good and thought-provoking pieces I’ve read – it’s well worth the read – although he’s getting at deeper problems than what I’m writing about here – and he has very good reasons to have a much deeper level of concern for the impact of AI tooling on open source communities. These are interesting times and we don’t have all the answers yet.

A few more things I’ve recently read, which I think are good:

CloudNativePG AI policy – https://github.com/cloudnative-pg/governance/blob/main/AI_POLICY.md
Linux Foundation AI Policy – https://www.linuxfoundation.org/legal/generative-ai
Bryan Cantrill, Oxide RFD – https://rfd.shared.oxide.computer/rfd/0576
Russ Cox on GoLang and AI – https://groups.google.com/g/golang-dev/c/4Li4Ovd_ehE/m/8L9s_jq4BAAJ?pli=1
Jordan Tigani about AI @ MotherDuck – [long painful URL for LinkedIn post]

I’ve also been writing bits and pieces of partial thoughts over the past week or two – my short blog post about the Scott Shambaugh situation (And thank you to Kim Bruning for the thoughtful email exchanges about this blog! Please continue to keep this old guy on his toes, reasoning through things, and challenging his thinking!)

There have been a bunch of LinkedIn messages too; capturing them here:

Mischa van den Burg wrote a LinkedIn post about whether ChatGPT in interviews is a red flag
- Brad Nicholson said “As someone that knows how to find that sort of info command line and has done so many, many times – I’d go to chat first, google second and the man pages last because the first two get me what I need faster than reading a man page.”
  
  .
- Replying to Brad:
  
  “I do the same thing, but we also understand this is in descending order of hallucination likelihood
  
  one of my favorite ways to use agents is to write me a script that demonstrates a behavior they claim… by the time the script is working, the claim is often significantly revised – and at present i still usually have to prevent them from making the test script work by moving the goalposts”
  
  .
Replying to Phil Eaton’s post about Russ Cox’s perspective on golang project approach (policy?) for AI:
- Russ Cox’s message is here
- i said “yes – the section here is a good excerpt” (referring to Phil’s excellent choice of what to screenshot)
  
  .
Replying to Kelsey Hightower’s post “Generative AI is a slop generation machine by default. You have to put in a lot of work to get something of quality from it.”
- It’s the same work I did before, just shifted left. I’m iterating on low-level detailed design spec and autogenerating code, rather than iterating on the code and trying to keep design docs in sync. I think of it as writing more of my code in detailed prose, flowcharts, sequence diagrams, and pseudocode – rather than writing it directly in the programming language and manually keeping the design docs in sync. But it’s the same work, minus time spent on syntax (which was never where the value was).
  
  .
- Replying to Adam Jacob’s comment: I think it remains true that “you get out what you put in”
  
  .
Replying to Jordan Tigani’s post about MotherDuck AI policy:
- Tricky topic. I built a deeply detailed design for overhauling how auth works on a core platform…
  * 291 prompts across 15 sessions over 3 days, comprising ~1,226 lines of prompt text
  * final design document is 1,904 lines of markdown — a ratio of roughly 2 lines of human-written prompt for every 3 lines of design document output
  
  a review of the full transcript showed a number of interesting characteristics of my prompts, including:
  * Persistent effort to simplify the tool’s initial proposals
  * Directly contributing critical domain knowledge
  * Frequent insistence on precise terminology
  
  Overall I’m satisfied and I think it’s a good doc that would have taken me 10x longer otherwise (especially research portions) – but I acknowledge mixed feelings.
  
  one thing that’s clear: i obviously got the AI game backwards. i thought it was scored like golf, where a lower ratio of input-to-output is a better score
  
  .
My own LinkedIn post: “We need to re-think OSS contribution attribution in light of AI. More than ever, it’s important for committers to give credit on where the ideas are coming from. A committer can copy/paste someone else’s ideas into their own prompts, and they need to give appropriate credit.”

.
Mentioning the Oxide RFD in my reply to Daniel Gustafsson:
- my thought is around crediting someone who participates meaningfully in the discussion, even if they didn’t author the final patch. a wall of text email that nobody entirely reads is not a meaningful contribution – but there are lots of ways AI can be part of a well-written email. it’s hard to find the objective line though about what this means.
  
  email moderation is going to get harder. trust and relationships were always important, and now even more so. i think using AI for research or to assist with writing is a net positive – as long as the final written product is concise and well-communicated and understood by the author. AI is the tool, but it’s finally still a human relationship. oxide’s RFD is good https://rfd.shared.oxide.computer/rfd/0576 – responsibility, rigor and empathy remain fundamental. old-school email lists might have a small advantage here. and i hope we can stay open to new people who seem interested to join and contribute
  
  .
Replying to Adam Jacob’s post: “If you’re thinking to yourself “this 10x increase in capability to create software doesn’t matter, because writing software was never the bottleneck”, you’re drawing the wrong conclusions from a true statement. … [skipping middle section, but go read the whole thing bc its good] … We will rebuild everything around this capability. Everything.”
- what people miss: it doesn’t need to be 10x more code, it can be same code 10x faster (which often is very little code — but it would have taken much longer to get it right)
  
  but Adam why are you telling everyone? i’m having so much fun right now, and once everyone figures it out then we’ll be back to the usual drill…
  
  .
I wrote a LinkedIn post about how I think moderation will get harder, then clarified a bit in reply to Andreas Scherbaum by pointing to Tomas Vondra’s blog because that’s much better than what I said.

The Scott Shambaugh Situation Clarifies How Dumb We Are Acting

Jeremy — Fri, 13 Feb 2026 19:05:59 +0000

Edit: related blog published Feb 22 – Openclaw is Spam, Like Any Other Automated Email

My personal blog here is dedicated to tech geek material, mostly about databases like postgres. I don’t get political, but at the moment I’m so irritated that I’m making the extraordinary exception to veer into the territory of flame-war opinionating…

This relates to Postgres because Scott is a volunteer maintainer on an open source project called matplotlib and the topic is something that we are all navigating in the open source space. Last night at the Seattle Postgres User Group meetup Claire Giordano gave a presentation about how the postgres community works and this was one of the first topics that came up in the Q&A at the end! Like every open source project, Postgres is trying to figure out how to deal with the rapid change of the industry as new, powerful, useful AI tools enable us to do things we couldn’t do before (which is great). Just two weeks ago, the CloudNativePG project released an AI Policy which builds on work from the Linux Foundation and discussion around the Ghostty policy. We’re in the middle of figuring this out and we’re working hard.

Just now, I saw this headline on the front page of the Wall Street Journal:

I personally find this to be outright alarming. And it’s the most clear expression that I’ve seen of deeply wrong, deeply concerning language we’ve all been observing. Many of us in tech communities are complicit in this, and now even press outlets like the WSJ are joining us in complicity.

Corrected headline: Software Engineer Responsible for Bullying, Due to Irresponsible Use of AI, Has Not Yet Apologized

This article uses language I hear people use all the time in the tech community: Several hours later, the bot apologized to Shambaugh for being “inappropriate and personal.”

This language basically removes accountability and responsibility from the human, who configured an AI agent with the ability to publish content that looks like a blog with zero editorial control – and I haven’t looked deeply but it seems like there may not be clear attribution of who the human is, that’s responsible for this content.

We all need to collectively take a breath and stop repeating this nonsense. A human created this, manages this, and is responsible for this.

It’s one thing when I hear this dumb language on LinkedIn, but I’m alarmed to see it on the front page of a major media outlet like the journal.

Our contributions to dialogue in the tech industry – on LinkedIn, at meetups, with coworkers, at conferences, on other social media, etc – these all make small contributions to our culture. Poor American culture seems in a weird cycle sometimes of taking a very long time to acknowledge very common-sense things, because vested interests (often with much financial motivation) want to push a certain narrative and everyone knows it’s bunk but nobody says so. Personally i think this applies to a wide array of issues, not just tech.

Folks, please speak up about stuff that’s stupid obvious. Bullying of open source maintainers should be alarming to us, and whoever the person is that’s responsible for this needs to step up and take responsibility. Personally.

And we all need to dial back this over-the-top anthropomorphizing of useful electronic gadgets that we’re building and selling.

Postgres client_connection_check_interval

Jeremy — Thu, 05 Feb 2026 04:54:40 +0000

Saw this post on LinkedIn yesterday:

I also somehow missed this setting for years. And it’s crazy timing, because it’s right after I published a blog about seeing the exact problem this solves. In my blog post I mentioned “unexpected behaviors (bugs?) in… Postgres itself.” Turns out Postgres already has the fix; it’s just disabled by default.

It was a one-line change to add the setting to my test suite and verify the impact. As a reminder, here’s the original problematic behavior which I just now reproduced again:

At the T=20sec mark, TPS drops from 700 to around 30. At T=26sec the total connections hit 100 (same as max_connections) and then TPS drops to almost zero. This total system outage continues until T=72sec when the system recovers after the blocking session has been killed by the transaction_timeout setting.

So what happens if we set client_connection_check_interval to 15 seconds? Quick addition to docker-compose.yml and we find out!

Fascinating! The brown line and the red line are the important ones. As before, the TPS drops at T=20sec and zeros out after we hit max_connections. But at T=35sec we start to see the total connection count slowly decrease! This continues until T=42sec when the PgBouncer connections are finally released – and at this point we repeat the whole cycle a second time, as the number of total connections climbs back up to the max.

So we can see that the 15 second client_connection_check_interval setting is working exactly as expected (if a little slowly) – at the 15 second mark Postgres begins to clean up the dead connections.

What if we do a lower setting like 2 seconds?

This looks even better! The total connections climbs to around 30-ish and holds stable there. And more importantly, the TPS never crashes out all the way to zero and the system is able to continue with a small workload until the blocking session is killed.

There is definitely some connection churn happening here (expected due to golang context timeouts) and with Postgres taking 2 seconds to clear them out, equilibrium is apparently around 30. A higher attempted TPS would bring this value higher.

Lets try one more time with an even lower setting of 500ms:

The TPS seems around the same and this time the connection count seems to stay very low.

Finally, lets take a look at the networking stack from the OS perspective at the number of sockets in CLOSE-WAIT state:

This is where the impact of client_connection_check_interval becomes very clear. Postgres is working exactly as expected and cleaning up dead connections based on the delay that’s specified in this parameter.

I find myself agreeing with Marat on LinkedIn, and I feel like there’s a strong case for giving this parameter a default value.

And now please excuse me while I go update my original blog post.

How Blocking-Lock Brownouts Can Escalate from Row-Level to Complete System Outages

Jeremy — Tue, 20 Jan 2026 04:23:48 +0000

This article is a shortened version. For the full writeup, go to https://github.com/ardentperf/pg-idle-test/tree/main/conn_exhaustion

This test suite demonstrates a failure mode when application bugs which poison connection pools collide with PgBouncers that are missing peer config and positioned behind a load balancer. PgBouncer’s peering feature (added with v1.19 in 2023) should be configured if multiple PgBouncers are being used with a load balancer – this feature prevents the escalation demonstrated here.

The failures described here are based on real-world experiences. While uncommon, this failure mode has been seen multiple times in the field.

Along the way, we discover unexpected behaviors (bugs?) in Go’s database/sql (or sqlx) connection pooler with the pgx client and in Postgres itself.

Sample output: https://github.com/ardentperf/pg-idle-test/actions/workflows/test.yml

The Problem in Brief

Go’s database/sql allows connection pools to become poisoned by returning connections with open transactions for re-use. Transactions opened with db.BeginTx() will be cleaned up, but – for example – conn.ExecContext(..., "BEGIN") will not be cleaned up. PR #2481 adds some cleanup logic in pgx for database/sql connection pools; I tested the PR with this test suite. The PR relies on the TxStatus indicator in the ReadyForStatus message which Postgres sends back to the client as part of its network protocol.

A poisoned connection pool can cause an application brownout since other sessions updating the same row wait indefinitely for the blocking transaction to commit or rollback its own update. On a high-activity or critical table, this can quickly lead to significant pile-ups of connections waiting to update the same locked row. With Go this means context deadline timeouts and retries and connection thrashing by all of the threads and processes that are trying to update the row. Backoff logic is often lacking in these code paths. When there is a currently running SQL (hung – waiting for a lock), pgx first tries to send a cancel request and then will proceed to a hard socket close.

If PgBouncer’s peering feature is not enabled, then cancel requests load-balanced across multiple PgBouncers will fail because the cancel key only exists on the PgBouncer that created the original connection. The peering feature solves the cancel routing problem by allowing PgBouncers to forward cancel requests to the correct peer that holds the cancel key. This feature should be enabled – the test suite demonstrates what happens when it is not.

Postgres immediately cleans up connections when it receives a cancel request. However, by default Postgres does not clean up connections when their TCP sockets are hard closed, if the connection is waiting for a lock. As a result, Postgres connection usage climbs while PgBouncer continually opens new connections that block on the same row. The app’s poisoned connection pool quickly leads to complete connection exhaustion in the Postgres server.

Edit Feb 5: Postgres setting client_connection_check_interval enables dead connection cleanup.

Existing connections will continue to work, as long as they don’t try to update the row which is locked. But the row-level brownout now becomes a database-level brownout – or perhaps a complete system outage (once the Go database/sql connection pool is exhausted) – because postgres rejects all new connection attempts from the application.

Result: Failed cancels → client closes socket → backends keep running → CLOSE_WAIT accumulates → Postgres hits max_connections → system outage

Architecture

The test uses Docker Compose to create this infrastructure with configurable number of PgBouncer instances.

The Test Scenarios

test_poisoned_connpool_exhaustion.sh accepts three parameters:

In this test suite:

The failure is injected 20 seconds after the test starts.
Idle connections are aborted and rolled back after 20 seconds.
Postgres is configured to abort and rollback any and all transactions if they are not completed within 40 seconds. Note that the transaction_timeout setting (for total transaction time) should be used cautiously, and is available in Postgres v17 and newer.

PgBouncer Count: 1 vs 2 (nopeers mode)

Config	Cancel Behavior	Outcome
1 PgBouncer	All cancels route to same instance	Cancels succeed, no connection exhaustion
2 PgBouncers	~50% cancels route to wrong instance	Cancels fail, connection exhaustion

Failure Mode: Sleep vs Poison

Mode	What Happens	Outcome	Timeout
sleep	Transaction with row lock is held for 40 seconds without returning to pool	Normal blocking scenario where lock holder is idle (not sending queries)	Idle timeout fires after 20s, terminates session & releases locks
poison	Transaction with row lock is returned to pool while still open	Bug where connections with open transactions are reused	Idle timeout never fires (connection is actively used). Transaction timeout fires after 40s, terminates session and releases locks

Pool Mode: nopeers vs peers (2 PgBouncers)

Mode	PgBouncer Config	Cancel Behavior
nopeers	Independent PgBouncers (no peer awareness)	Cancel requests may route to wrong PgBouncer via load balancer
peers	PgBouncer peers enabled (cancel key sharing)	Cancel requests are forwarded to correct peer

Summary

PgBouncers	Failure Mode	Pool Mode	Expected Outcome
2	poison	nopeers	Database-level Brownout or System Outage – TPS crashes to ~4, server connections max out at 95, TCP sockets accumulate in CLOSE_WAIT state, cl_waiting spikes
1	poison	nopeers	Row-level Brownout – TPS drops with no recovery (~11), server connections stay healthy at ~11, no server connection exhaustion
2	poison	peers	Row-level Brownout – TPS drops with no recovery (~15), cl_waiting stays at 0, peers forward cancels correctly
2	sleep	nopeers	Database-level Brownout or System Outage – Server connection spike to 96, full recovery after lock released and some extra time, system outage vs brownout depends on how quickly the idle timeout releases lock
2	sleep	peers	Row-level Brownout – No connection spike, full recovery after lock released, no risk of system outage

Test Results

Transactions Per Second

TPS is the best indicator of actual application impact. It’s important to notice that PgBouncer peering does not prevent application impact from either poisoned connection pools or sleeping sessions. The section below titled “Detection and Prevention” has ideas which address the actual root cause and truly prevent application impact.

After the lock is acquired at t=20, TPS drops from ~700 to near zero in all cases as workers block on the locked row held by the open transaction.

Sleep mode (orange/green lines): Around t=40, Postgres’s idle_in_transaction_session_timeout (20s) fires and kills the blocking session. TPS recovers to ~600-700.

Poison mode (red/purple/blue lines): The lock-holding connection is never idle—it’s constantly being picked up by workers attempting queries—so the idle timeout never fires. TPS remains near zero until Postgres’s transaction_timeout (40s) fires at t=60, finally terminating the long-running transaction and releasing the lock.

TCP CLOSE-WAIT Accumulation

2 PgBouncers (nopeers) (red/orange lines): CLOSE_WAIT connections accumulate rapidly because:

Cancel request goes to wrong PgBouncer → fails
Client gives up and closes socket
Server backend is still blocked on lock, hasn’t read the TCP close
Connection enters CLOSE_WAIT state on Postgres

In poison mode (red), CLOSE_WAIT remains at ~95 until transaction_timeout fires at t=60. In sleep mode (orange), CLOSE_WAIT clears around t=40 when idle_in_transaction_session_timeout fires.

1 PgBouncer and peers modes (purple/blue/green lines): Minimal or zero CLOSE_WAIT because cancel requests succeed—either routing to the single PgBouncer or being forwarded to the correct peer.

Connection Pool Wait Time vs PgBouncer Client Wait

Go’s database/sql pool tracks how long goroutines wait to acquire a connection (db.Stats().WaitDuration). PgBouncer tracks cl_waiting—clients waiting for a server connection. These metrics measure wait time at different layers of the stack.

This graph shows 2 PgBouncers in poison mode (nopeers)—the worst-case scenario:

Total Connections (brown) climb rapidly after poison injection at t=20 as failed cancels leave backends in CLOSE_WAIT
TPS (green) crashes to near zero and stays there until transaction_timeout fires at t=60
oldest_xact_age (purple) climbs steadily from 0 to 40 seconds
Once Postgres hits max_connections - superuser_reserved_connections (95), new connections are refused
PgBouncer #1 cl_waiting (red) then spikes as clients queue up waiting for available connections
AvgWait (blue) increases as workers wait for the non-blocked connections to become available

Note the gap between when transaction_timeout fires (t=60, visible as oldest_xact_age dropping to 0) and when TPS fully recovers. TPS recovery correlates with cl_waiting dropping back to zero—PgBouncer needs time to clear the queue of waiting clients and re-establish healthy connection flow. This recovery gap only occurs in nopeers mode; the TPS comparison graph shows that peers mode recovers immediately when the lock is released because connections never exhaust and cl_waiting stays at zero.

Why is AvgWait (blue) so low despite the system being in distress? The poisoned connection (holding the lock) continues executing transactions without blocking—it already holds the lock, so its queries succeed immediately. This one connection cycling rapidly through the pool with sub-millisecond wait times heavily skews the average lower, masking the fact that other connections are blocked.

The cl_waiting metric is collected as cnpg_pgbouncer_pools_cl_waiting from CloudNativePG. See CNPG PgBouncer metrics.

Detection and Prevention

Monitoring and Alerting:

Alert on:

Number of backends waiting on locks over some threshold
cnpg_backends_total showing established connections at a high percentage of max_connections
cnpg_backends_max_tx_duration_seconds showing transactions open for longer than some threshold (nb. long-running queries are often legitimate)

-- Count backends waiting on locks
SELECT count(*) FROM pg_stat_activity WHERE wait_event_type = 'Lock';

Prevention Options:

Options to prevent the root cause (connection pool poisoning):

Find and fix connection leaks in the application – ensure all transactions are properly committed or rolled back
Use OptionResetSession callback – automatically discard leaked connections (see below)
Fix at the driver level – PR #2481 adds automatic detection in pgx

Options to prevent the escalation from row-level brownout to system outage:

Enable PgBouncer peering – if using multiple PgBouncers behind a load balancer, configure the peer_id and [peers] section so cancel requests are forwarded to the correct instance (see PgBouncer documentation). This prevents connection exhaustion but does not prevent the TPS drop from lock contention.
Use session affinity (sticky sessions) in the load balancer based on client IP – ensures cancel requests route to the same PgBouncer as the original connection (see HAProxy Session Affinity example below)

Options to limit the duration/impact:

Set appropriate timeout defaults – configure system-wide timeouts to automatically terminate problematic sessions:
- idle_in_transaction_session_timeout – terminates sessions idle in a transaction (e.g., 5min)
- transaction_timeout (Postgres 17+) – use caution; limits total transaction duration regardless of activity (e.g., 30min)

Postgres:

Edit Feb 5: Postgres setting client_connection_check_interval enables dead connection cleanup.

Results Summary, Understanding the Layers Leading to the System Outage, Unique Problems, and more - available in the full writeup at https://github.com/ardentperf/pg-idle-test/tree/main/conn_exhaustion

Postgres Booth at PASS Data Community Summit

Jeremy — Sun, 30 Nov 2025 23:40:49 +0000

PASS Data Community Summit 2025 wrapped up last week. This conference originated 25 years ago with the independent, user-led, not-for-profit “Professional Association for SQL Server (PASS)” and the annual summit in Seattle continues to attract thousands of database professionals each year. After the pandemic it was reorganized and broadened as a “Data Community” event, including a Postgres track.

Starting in 2023, volunteers from the Seattle Postgres User Group have staffed a postgres community booth on the exhibition floor. We provide information about Postgres User Groups around the world and do our best to answer all kinds of questions people have about Postgres. The booth consistently gets lots of traffic and questions.

The United States PostgreSQL Association has generously supplied one of their booth kits each year, which has a banner/background and some booth materials like stickers and a map with many user groups and a “welcome to postgres” handout and postgres major version handouts. We supplement with extra pins and stickers and printouts like the happiness hints I’ve put together, a list of common extensions that Rox made, and a list of Postgres events that Lloyd made. Every year, we also bring leftover Halloween candy that we want to get rid of and we put it in a big bowl on the table.

One of the top questions people ask is how and where they can learn more about Postgres. Next year I might just print out the Links section from my blog, which has a bunch of useful free resources. Another idea I have is for Redgate and EnterpriseDB – I think both of these companies have paid training but also give free access to a few introductory classes – it would be nice if they made a small card with a link to their free training. I think we could have a stack of these cards at our user groups and at the PASS booth. The company can promote paid training, but the free content can benefit anyone even if they aren’t interested in the paid training. I might also reach out to other companies who have paid training and see if they’d be willing to open up a bit of pre-recorded introductory content for free. (Data Egret? Creston Jamison?) Come to think of it, a list of weekly newsletters and podcasts might also be a great thing to print on a handout or a card – Postgres Weekly, postgres.fm, Talking Postgres, Scaling Postgres, etc.

The disk in the picture below is not from our booth; it’s an original SQL server installation disk and the crew over at Fortified apparently found a whole box of them on eBay and were handing them out over at their booth. As a result, I overheard someone explaining to another conference attendee what is a “floppy disk” and why does the bottom open. (In the background, on our booth table, you can see that I “fixed” the DocumentDB sticker…)

This year, I took my home office white board and drove it down to the convention center along with a bunch of magnets. Rick Lowe’s wife Becka picked up two wall-mount metal mesh file organizers and four S-hooks, which we hung on the white board and filled with handouts that Ben Chobot printed on his home printer. Thank you! This worked really well and you can see it in the picture below. It freed up space on the table for other things like pins and stickers, the raffle, and a very cool elephant that Lloyd brought.

As always: a huge shout-out to our local volunteers! From the left in the picture below: Lloyd Albin, me, Ben Chobot, Deon Gill, Rick Lowe, and… Pavlo Golub who is not technically local but joined us for our volunteer dinner/hangout! Harry Pierson missed our volunteer dinner but he’s on the right side in the booth picture above.

We raffled off a signed copy of Ryan Booz and Grant Fritchey’s new book: Introduction to PostgreSQL for the data professional. Congratulations to our winner – Tomi from Croatia!

Most of my time was at the booth. I had one speaking session on Friday, and spoke about CloudNativePG Quorum Failover. I originally intended to just expand the talk from KubeCon the week before. But I ended up heavily re-writing after realizing that out of 242 sessions at PASS there were only 6 that even mentioned Kubernetes. I ended up spending the first half of the talk with a simple introduction to containers and Kubernetes – a couple slides and then a terminal window with docker and kind to demonstrate the basics.

Finally, it was great fun to catch up with some old Oracle friends like Kellyn Gorman, Gustavo René Antúnez, Shane Borden and Gleb Otochkin. And of course it was great to see Lukas Fitl and Ryan Booz and Grant Fritchey. These are all solid, amazing people and if you ever see them at a conference then don’t hesitate to introduce yourself and strike up a conversation!

I enjoy traveling for conferences but I’m still in a season of limited travel for family reasons (and probably will be for awhile) – so I look forward to any time Postgres people visit Seattle. Helping organize the Postgres booth for PASS is a bit of work, but it’s worthwhile for the chance to connect. I look forward to seeing the Postgres track grow at PASS Data Community Summit!

KubeCon 2025: Bookmarks on Memory and Postgres

Jeremy — Sun, 16 Nov 2025 22:55:31 +0000

Just got home from KubeCon.

One of my big goals for the trip was to make some progress in a few areas of postgres and kubernetes – primarily around allowing more flexible use of the linux page cache and avoiding OOM kills with less hardware overprovisioning. When I look at Postgres on Kubernetes, I think there are idle resources (both memory and CPU) on the table with the current Postgres deployment models that generally use guaranteed QoS.

Ultimately this is about cost savings. I think we can still run more databases on less hardware without compromising the availability and reliability of our database services.

The trip was a success, because I came home with lots of reading material and homework!

Putting a few bookmarks here, mostly for myself to come back to later:

key place for discussion is sig-node
documentation on node-pressure eviction https://kubernetes.io/docs/concepts/scheduling-eviction/node-pressure-eviction/
- eviction signal thresholds can be customized
- it looks like priority classes give a lot of control over the order in which pods are evicted
documentation on priority classes https://kubernetes.io/docs/concepts/scheduling-eviction/pod-priority-preemption/
cgroups v2 memory controller documentation https://docs.kernel.org/admin-guide/cgroup-v2.html#memory
long running github issue about pod evictions due to kubernetes (incorrectly?) interpreting active page cache as working memory that won’t be reclaimed https://github.com/kubernetes/kubernetes/issues/43916
new feature MemoryQOS – still alpha (feature gate off-by-default)
- KEP https://github.com/kubernetes/enhancements/tree/master/keps/sig-node/2570-memory-qos/
  - currently stalled – related message from Linux Kernel Mailing Lists https://lkml.org/lkml/2023/6/1/1300
  - “Future: memory.high can be used to implement kill policies in for userspace OOMs, together with Pressure Stall Information (PSI). When the workloads are in stuck after their memory usage levels reach memory.high, high PSI can be used by userspace OOM policy to kill such workload(s).”
- Nov 2021 blog https://kubernetes.io/blog/2021/11/26/qos-memory-resources/
- May 2023 blog https://kubernetes.io/blog/2023/05/05/qos-memory-resources/
- Brief mention in docs https://kubernetes.io/docs/concepts/workloads/pods/pod-qos/#memory-qos-with-cgroup-v2
metrics added to CAdvisor for both active and inactive page cache https://github.com/google/cadvisor/pull/3445
metric added for PSI https://kubernetes.io/blog/2025/09/04/kubernetes-v1-34-introducing-psi-metrics-beta/
homework – taking a closer look at anonymous memory and page cache metrics (both active and inactive) for real postgres databases on kubernetes
homework – set up tests that emulate the diagram below and demonstrate the eviction behavior that i think will happen

I still have a lot of catching up to do. I sketched out the diagram below, but please take this with a large grain of salt – this aspect of kubernetes is complex and linux memory management is complex:

I tried to summarize some thoughts in a comment on the long-running github issue, but this might be wrong – it’s just what I’ve managed to piece together so far.

My “user story” is that (1) I’d like higher limit and more memory over-commit for page cache specifically – letting linux use available/unused memory as needed for page cache and (2) I’d like lower request to get scheduling closer to actual anonymous memory needs. I’m running Postgres. In the current state, I have to simultaneously set an artificially low limit on per-pod page cache (to avoid eviction) and artificially high request on per-pod anonymous memory (to avoid OOM by getting oom_score_adj). I’d like individual pods able to burst anonymous memory usage (eg. an unexpected SQL query that hogs memory), if we can steal from page cache of other pods beyond their request – avoiding OOM. The linux kernel can do this; I think it should be possible with the right cgroup settings?

It seems like the new Memory QOS feature might be assigning a static calculated value to memory.high – but for page cache usage, I wonder if we actually want kubernetes to dynamically adjust memory.high eventually as low as request in an attempt to reclaim node-level resources – before evicting end-user pods – when the memory.available eviction signal has exceeded the threshold?

Anyway it’s also worth pointing out that the postgres problems are likely accentuated by higher concentrations of postgres on nodes; if databases are spread across large multi-tenant clusters that likely mitigates things a bit.

Edit 11/29: Alexey Demidov replied on the github issue and pointed out the problem; the linux kernel throttles CPU of processes when we use memory.high so this probably makes my idea above ineffective.

Explaining IPC:SyncRep – Postgres Sync Replication is Not Actually Sync Replication

Jeremy — Mon, 27 Oct 2025 23:12:27 +0000

Postgres database-level “synchronous replication” does not actually mean the replication is synchronous. It’s a bit of a lie really. The replication is actually – always – asynchronous. What it actually means is “when the client issues a COMMIT then pause until we know the transaction is replicated.” In fact the primary writer database doesn’t need to wait for the replicas to catch up UNTIL the client issues a COMMIT …and even then it’s only a single individual connection which waits. This has many interesting properties.

One benefit is throughput and performance. It means that much of the database workload is actually asynchronous – which tends to work pretty well. The replication stream operates in parallel to the primary workload.

But an interesting drawback is that you can get into situations where the primary can speed ahead of the replica quite a bit before that COMMIT statement hits and then the specific client who issued the COMMIT will need to sit and wait for awhile. It also means that bulk operations like pg_repack or VACUUM FULL or REFRESH MATERIALIZED VIEW or COPY do not have anything to throttle them. They will generate WAL basically as fast as it can be written to the local disk. In the mean time, everybody else on the system will see their COMMIT operations start to exhibit dramatic hangs and will see apparent sudden performance drops – while they wait for their commit record to eventually get replicated by a lagging replication stream. It can be non-obvious that this performance degradation is completely unrelated to the queries that appear to be slowing down. This is the infamous IPC:SyncRep wait event.

Another drawback: as the replication stream begins to lag, the amount of disk needed for WAL storage balloons. This makes it challenging to predict the required size of a dedicated volume for WAL. A system might seem to have lots of headroom, and then a pg_repack on a large table might fill the WAL volume without warning.

This is a bit different from storage-level synchronous replication. With storage-level replication, each IO operation performing a write to the disk needs to be replicated. Postgres has a single WAL stream – so if any connection issues a COMMIT then postgres will immediately fsync the entire WAL stream up to that point – including all of the WAL for the bulk operation. In this way, the fsync works a little bit like the IPC:SyncRep wait – however I have a sense that fsync somehow introduces more backpressure into the system as a whole and likely provides at least a small amount of healthy throttling for large bulk operations.

When your workload consists ONLY of small short transactions, Postgres database-level replication can work really well and there’s back-pressure that keeps the database system in equilibrium. This Postgres database won’t lag because each individual transaction pauses. The problem is when you start injecting those big bulk operations with no back-pressure to throttle them.

This is also the reason why autovacuum_vacuum_cost_delay of zero can cause chaos and is a bad idea; it unleashes a vacuum running at full speed and generates massive & bursty amounts of WAL for large busy tables, as fast as it can write to the disk.

If you’re seeing the IPC:SyncRep wait event then one of the first things you should do is analyze your WAL activity. Something along these lines might be useful, if you’re debugging in real time (or add something similar to your monitoring system):

psql --csv -Xtc "create extension pg_walinspect"
psql --csv -Xtc "select now(),pg_current_wal_lsn()"  >>wal-data.csv

while true; do
  NEXTWAL=$(grep ^2025 wal-data.csv|tail -1|cut -d, -f2)
  psql --csv -c "SELECT now(),pg_current_wal_lsn(),* 
          FROM pg_get_wal_stats('$NEXTWAL', pg_current_wal_lsn())" >>wal-data.csv
  echo $(date) - $NEXTWAL
  sleep 1
done

One potential idea for fixing this would be to add code into postgres vacuum and refresh materialized view and repack and copy which checks the value of the synchronous_commit parameter and performs periodic pauses according to how it’s set. This is a bit like the idea of doing “batch commits” during large bulk data loads, but we don’t need a real commit – we just need to periodically wait for the remote LSN to catch up, according to the value of synchronous_commit. This would provide a bit more healthy back-pressure to throttle those bulk operations, and might protect the rest of the system from such dramatic negative impact.

It might also be good to come up with some monitoring queries which can make it clear when a single connection is flooding the WAL stream with one bulk operation, versus an aggregate total across many write-heavy connections.