Linux – Ardent Performance Computing

Openclaw is Spam, Like Any Other Automated Email

Jeremy — Mon, 23 Feb 2026 01:23:08 +0000

Open Source communities are trying to quickly adapt to the present rapid advances in technology. I would like to propose some clarity around something that should be common sense.

Automated emails are spam. They always have been. Openclaw (and whatever new thing surfaces this summer) is no different.

Policies saying automated emails/messages are banned – including anything AI generated – are not only common-sense policies, they aren’t even a change from how we’ve always worked. This includes automated comments on github issues, automated PRs, automated patch submissions, and even any kind of automated review. Copilot automated reviews, snyk, etc – are ok if-and-only-if it’s configured by the owners of the repo/project. Common sense.

Enforcement of these policies – more than ever – depends on trust and relationships. I do think, for example, that non-native-english-speakers should be allowed to use AI to help them check their english. Used responsibly, AI tools can help a lot with language learning! Your grammar checker is probably based on some kind of LLM anyway. But I’m saying that a human always presses the “send” button on the message, and this human is responsible for the words they sent. If moderators suspect automated messages, every open source project should have a policy they can cite for blocking/banning the account.

Tomas Vondra’s article “the AI inversion” is the latest of many good and thought-provoking pieces I’ve read – it’s well worth the read – although he’s getting at deeper problems than what I’m writing about here – and he has very good reasons to have a much deeper level of concern for the impact of AI tooling on open source communities. These are interesting times and we don’t have all the answers yet.

A few more things I’ve recently read, which I think are good:

CloudNativePG AI policy – https://github.com/cloudnative-pg/governance/blob/main/AI_POLICY.md
Linux Foundation AI Policy – https://www.linuxfoundation.org/legal/generative-ai
Bryan Cantrill, Oxide RFD – https://rfd.shared.oxide.computer/rfd/0576
Russ Cox on GoLang and AI – https://groups.google.com/g/golang-dev/c/4Li4Ovd_ehE/m/8L9s_jq4BAAJ?pli=1
Jordan Tigani about AI @ MotherDuck – [long painful URL for LinkedIn post]

I’ve also been writing bits and pieces of partial thoughts over the past week or two – my short blog post about the Scott Shambaugh situation (And thank you to Kim Bruning for the thoughtful email exchanges about this blog! Please continue to keep this old guy on his toes, reasoning through things, and challenging his thinking!)

There have been a bunch of LinkedIn messages too; capturing them here:

Mischa van den Burg wrote a LinkedIn post about whether ChatGPT in interviews is a red flag
- Brad Nicholson said “As someone that knows how to find that sort of info command line and has done so many, many times – I’d go to chat first, google second and the man pages last because the first two get me what I need faster than reading a man page.”
  
  .
- Replying to Brad:
  
  “I do the same thing, but we also understand this is in descending order of hallucination likelihood
  
  one of my favorite ways to use agents is to write me a script that demonstrates a behavior they claim… by the time the script is working, the claim is often significantly revised – and at present i still usually have to prevent them from making the test script work by moving the goalposts”
  
  .
Replying to Phil Eaton’s post about Russ Cox’s perspective on golang project approach (policy?) for AI:
- Russ Cox’s message is here
- i said “yes – the section here is a good excerpt” (referring to Phil’s excellent choice of what to screenshot)
  
  .
Replying to Kelsey Hightower’s post “Generative AI is a slop generation machine by default. You have to put in a lot of work to get something of quality from it.”
- It’s the same work I did before, just shifted left. I’m iterating on low-level detailed design spec and autogenerating code, rather than iterating on the code and trying to keep design docs in sync. I think of it as writing more of my code in detailed prose, flowcharts, sequence diagrams, and pseudocode – rather than writing it directly in the programming language and manually keeping the design docs in sync. But it’s the same work, minus time spent on syntax (which was never where the value was).
  
  .
- Replying to Adam Jacob’s comment: I think it remains true that “you get out what you put in”
  
  .
Replying to Jordan Tigani’s post about MotherDuck AI policy:
- Tricky topic. I built a deeply detailed design for overhauling how auth works on a core platform…
  * 291 prompts across 15 sessions over 3 days, comprising ~1,226 lines of prompt text
  * final design document is 1,904 lines of markdown — a ratio of roughly 2 lines of human-written prompt for every 3 lines of design document output
  
  a review of the full transcript showed a number of interesting characteristics of my prompts, including:
  * Persistent effort to simplify the tool’s initial proposals
  * Directly contributing critical domain knowledge
  * Frequent insistence on precise terminology
  
  Overall I’m satisfied and I think it’s a good doc that would have taken me 10x longer otherwise (especially research portions) – but I acknowledge mixed feelings.
  
  one thing that’s clear: i obviously got the AI game backwards. i thought it was scored like golf, where a lower ratio of input-to-output is a better score
  
  .
My own LinkedIn post: “We need to re-think OSS contribution attribution in light of AI. More than ever, it’s important for committers to give credit on where the ideas are coming from. A committer can copy/paste someone else’s ideas into their own prompts, and they need to give appropriate credit.”

.
Mentioning the Oxide RFD in my reply to Daniel Gustafsson:
- my thought is around crediting someone who participates meaningfully in the discussion, even if they didn’t author the final patch. a wall of text email that nobody entirely reads is not a meaningful contribution – but there are lots of ways AI can be part of a well-written email. it’s hard to find the objective line though about what this means.
  
  email moderation is going to get harder. trust and relationships were always important, and now even more so. i think using AI for research or to assist with writing is a net positive – as long as the final written product is concise and well-communicated and understood by the author. AI is the tool, but it’s finally still a human relationship. oxide’s RFD is good https://rfd.shared.oxide.computer/rfd/0576 – responsibility, rigor and empathy remain fundamental. old-school email lists might have a small advantage here. and i hope we can stay open to new people who seem interested to join and contribute
  
  .
Replying to Adam Jacob’s post: “If you’re thinking to yourself “this 10x increase in capability to create software doesn’t matter, because writing software was never the bottleneck”, you’re drawing the wrong conclusions from a true statement. … [skipping middle section, but go read the whole thing bc its good] … We will rebuild everything around this capability. Everything.”
- what people miss: it doesn’t need to be 10x more code, it can be same code 10x faster (which often is very little code — but it would have taken much longer to get it right)
  
  but Adam why are you telling everyone? i’m having so much fun right now, and once everyone figures it out then we’ll be back to the usual drill…
  
  .
I wrote a LinkedIn post about how I think moderation will get harder, then clarified a bit in reply to Andreas Scherbaum by pointing to Tomas Vondra’s blog because that’s much better than what I said.

The Scott Shambaugh Situation Clarifies How Dumb We Are Acting

Jeremy — Fri, 13 Feb 2026 19:05:59 +0000

Edit: related blog published Feb 22 – Openclaw is Spam, Like Any Other Automated Email

My personal blog here is dedicated to tech geek material, mostly about databases like postgres. I don’t get political, but at the moment I’m so irritated that I’m making the extraordinary exception to veer into the territory of flame-war opinionating…

This relates to Postgres because Scott is a volunteer maintainer on an open source project called matplotlib and the topic is something that we are all navigating in the open source space. Last night at the Seattle Postgres User Group meetup Claire Giordano gave a presentation about how the postgres community works and this was one of the first topics that came up in the Q&A at the end! Like every open source project, Postgres is trying to figure out how to deal with the rapid change of the industry as new, powerful, useful AI tools enable us to do things we couldn’t do before (which is great). Just two weeks ago, the CloudNativePG project released an AI Policy which builds on work from the Linux Foundation and discussion around the Ghostty policy. We’re in the middle of figuring this out and we’re working hard.

Just now, I saw this headline on the front page of the Wall Street Journal:

I personally find this to be outright alarming. And it’s the most clear expression that I’ve seen of deeply wrong, deeply concerning language we’ve all been observing. Many of us in tech communities are complicit in this, and now even press outlets like the WSJ are joining us in complicity.

Corrected headline: Software Engineer Responsible for Bullying, Due to Irresponsible Use of AI, Has Not Yet Apologized

This article uses language I hear people use all the time in the tech community: Several hours later, the bot apologized to Shambaugh for being “inappropriate and personal.”

This language basically removes accountability and responsibility from the human, who configured an AI agent with the ability to publish content that looks like a blog with zero editorial control – and I haven’t looked deeply but it seems like there may not be clear attribution of who the human is, that’s responsible for this content.

We all need to collectively take a breath and stop repeating this nonsense. A human created this, manages this, and is responsible for this.

It’s one thing when I hear this dumb language on LinkedIn, but I’m alarmed to see it on the front page of a major media outlet like the journal.

Our contributions to dialogue in the tech industry – on LinkedIn, at meetups, with coworkers, at conferences, on other social media, etc – these all make small contributions to our culture. Poor American culture seems in a weird cycle sometimes of taking a very long time to acknowledge very common-sense things, because vested interests (often with much financial motivation) want to push a certain narrative and everyone knows it’s bunk but nobody says so. Personally i think this applies to a wide array of issues, not just tech.

Folks, please speak up about stuff that’s stupid obvious. Bullying of open source maintainers should be alarming to us, and whoever the person is that’s responsible for this needs to step up and take responsibility. Personally.

And we all need to dial back this over-the-top anthropomorphizing of useful electronic gadgets that we’re building and selling.

Postgres client_connection_check_interval

Jeremy — Thu, 05 Feb 2026 04:54:40 +0000

Saw this post on LinkedIn yesterday:

I also somehow missed this setting for years. And it’s crazy timing, because it’s right after I published a blog about seeing the exact problem this solves. In my blog post I mentioned “unexpected behaviors (bugs?) in… Postgres itself.” Turns out Postgres already has the fix; it’s just disabled by default.

It was a one-line change to add the setting to my test suite and verify the impact. As a reminder, here’s the original problematic behavior which I just now reproduced again:

At the T=20sec mark, TPS drops from 700 to around 30. At T=26sec the total connections hit 100 (same as max_connections) and then TPS drops to almost zero. This total system outage continues until T=72sec when the system recovers after the blocking session has been killed by the transaction_timeout setting.

So what happens if we set client_connection_check_interval to 15 seconds? Quick addition to docker-compose.yml and we find out!

Fascinating! The brown line and the red line are the important ones. As before, the TPS drops at T=20sec and zeros out after we hit max_connections. But at T=35sec we start to see the total connection count slowly decrease! This continues until T=42sec when the PgBouncer connections are finally released – and at this point we repeat the whole cycle a second time, as the number of total connections climbs back up to the max.

So we can see that the 15 second client_connection_check_interval setting is working exactly as expected (if a little slowly) – at the 15 second mark Postgres begins to clean up the dead connections.

What if we do a lower setting like 2 seconds?

This looks even better! The total connections climbs to around 30-ish and holds stable there. And more importantly, the TPS never crashes out all the way to zero and the system is able to continue with a small workload until the blocking session is killed.

There is definitely some connection churn happening here (expected due to golang context timeouts) and with Postgres taking 2 seconds to clear them out, equilibrium is apparently around 30. A higher attempted TPS would bring this value higher.

Lets try one more time with an even lower setting of 500ms:

The TPS seems around the same and this time the connection count seems to stay very low.

Finally, lets take a look at the networking stack from the OS perspective at the number of sockets in CLOSE-WAIT state:

This is where the impact of client_connection_check_interval becomes very clear. Postgres is working exactly as expected and cleaning up dead connections based on the delay that’s specified in this parameter.

I find myself agreeing with Marat on LinkedIn, and I feel like there’s a strong case for giving this parameter a default value.

And now please excuse me while I go update my original blog post.

How Blocking-Lock Brownouts Can Escalate from Row-Level to Complete System Outages

Jeremy — Tue, 20 Jan 2026 04:23:48 +0000

This article is a shortened version. For the full writeup, go to https://github.com/ardentperf/pg-idle-test/tree/main/conn_exhaustion

This test suite demonstrates a failure mode when application bugs which poison connection pools collide with PgBouncers that are missing peer config and positioned behind a load balancer. PgBouncer’s peering feature (added with v1.19 in 2023) should be configured if multiple PgBouncers are being used with a load balancer – this feature prevents the escalation demonstrated here.

The failures described here are based on real-world experiences. While uncommon, this failure mode has been seen multiple times in the field.

Along the way, we discover unexpected behaviors (bugs?) in Go’s database/sql (or sqlx) connection pooler with the pgx client and in Postgres itself.

Sample output: https://github.com/ardentperf/pg-idle-test/actions/workflows/test.yml

The Problem in Brief

Go’s database/sql allows connection pools to become poisoned by returning connections with open transactions for re-use. Transactions opened with db.BeginTx() will be cleaned up, but – for example – conn.ExecContext(..., "BEGIN") will not be cleaned up. PR #2481 adds some cleanup logic in pgx for database/sql connection pools; I tested the PR with this test suite. The PR relies on the TxStatus indicator in the ReadyForStatus message which Postgres sends back to the client as part of its network protocol.

A poisoned connection pool can cause an application brownout since other sessions updating the same row wait indefinitely for the blocking transaction to commit or rollback its own update. On a high-activity or critical table, this can quickly lead to significant pile-ups of connections waiting to update the same locked row. With Go this means context deadline timeouts and retries and connection thrashing by all of the threads and processes that are trying to update the row. Backoff logic is often lacking in these code paths. When there is a currently running SQL (hung – waiting for a lock), pgx first tries to send a cancel request and then will proceed to a hard socket close.

If PgBouncer’s peering feature is not enabled, then cancel requests load-balanced across multiple PgBouncers will fail because the cancel key only exists on the PgBouncer that created the original connection. The peering feature solves the cancel routing problem by allowing PgBouncers to forward cancel requests to the correct peer that holds the cancel key. This feature should be enabled – the test suite demonstrates what happens when it is not.

Postgres immediately cleans up connections when it receives a cancel request. However, by default Postgres does not clean up connections when their TCP sockets are hard closed, if the connection is waiting for a lock. As a result, Postgres connection usage climbs while PgBouncer continually opens new connections that block on the same row. The app’s poisoned connection pool quickly leads to complete connection exhaustion in the Postgres server.

Edit Feb 5: Postgres setting client_connection_check_interval enables dead connection cleanup.

Existing connections will continue to work, as long as they don’t try to update the row which is locked. But the row-level brownout now becomes a database-level brownout – or perhaps a complete system outage (once the Go database/sql connection pool is exhausted) – because postgres rejects all new connection attempts from the application.

Result: Failed cancels → client closes socket → backends keep running → CLOSE_WAIT accumulates → Postgres hits max_connections → system outage

Architecture

The test uses Docker Compose to create this infrastructure with configurable number of PgBouncer instances.

The Test Scenarios

test_poisoned_connpool_exhaustion.sh accepts three parameters:

In this test suite:

The failure is injected 20 seconds after the test starts.
Idle connections are aborted and rolled back after 20 seconds.
Postgres is configured to abort and rollback any and all transactions if they are not completed within 40 seconds. Note that the transaction_timeout setting (for total transaction time) should be used cautiously, and is available in Postgres v17 and newer.

PgBouncer Count: 1 vs 2 (nopeers mode)

Config	Cancel Behavior	Outcome
1 PgBouncer	All cancels route to same instance	Cancels succeed, no connection exhaustion
2 PgBouncers	~50% cancels route to wrong instance	Cancels fail, connection exhaustion

Failure Mode: Sleep vs Poison

Mode	What Happens	Outcome	Timeout
sleep	Transaction with row lock is held for 40 seconds without returning to pool	Normal blocking scenario where lock holder is idle (not sending queries)	Idle timeout fires after 20s, terminates session & releases locks
poison	Transaction with row lock is returned to pool while still open	Bug where connections with open transactions are reused	Idle timeout never fires (connection is actively used). Transaction timeout fires after 40s, terminates session and releases locks

Pool Mode: nopeers vs peers (2 PgBouncers)

Mode	PgBouncer Config	Cancel Behavior
nopeers	Independent PgBouncers (no peer awareness)	Cancel requests may route to wrong PgBouncer via load balancer
peers	PgBouncer peers enabled (cancel key sharing)	Cancel requests are forwarded to correct peer

Summary

PgBouncers	Failure Mode	Pool Mode	Expected Outcome
2	poison	nopeers	Database-level Brownout or System Outage – TPS crashes to ~4, server connections max out at 95, TCP sockets accumulate in CLOSE_WAIT state, cl_waiting spikes
1	poison	nopeers	Row-level Brownout – TPS drops with no recovery (~11), server connections stay healthy at ~11, no server connection exhaustion
2	poison	peers	Row-level Brownout – TPS drops with no recovery (~15), cl_waiting stays at 0, peers forward cancels correctly
2	sleep	nopeers	Database-level Brownout or System Outage – Server connection spike to 96, full recovery after lock released and some extra time, system outage vs brownout depends on how quickly the idle timeout releases lock
2	sleep	peers	Row-level Brownout – No connection spike, full recovery after lock released, no risk of system outage

Test Results

Transactions Per Second

TPS is the best indicator of actual application impact. It’s important to notice that PgBouncer peering does not prevent application impact from either poisoned connection pools or sleeping sessions. The section below titled “Detection and Prevention” has ideas which address the actual root cause and truly prevent application impact.

After the lock is acquired at t=20, TPS drops from ~700 to near zero in all cases as workers block on the locked row held by the open transaction.

Sleep mode (orange/green lines): Around t=40, Postgres’s idle_in_transaction_session_timeout (20s) fires and kills the blocking session. TPS recovers to ~600-700.

Poison mode (red/purple/blue lines): The lock-holding connection is never idle—it’s constantly being picked up by workers attempting queries—so the idle timeout never fires. TPS remains near zero until Postgres’s transaction_timeout (40s) fires at t=60, finally terminating the long-running transaction and releasing the lock.

TCP CLOSE-WAIT Accumulation

2 PgBouncers (nopeers) (red/orange lines): CLOSE_WAIT connections accumulate rapidly because:

Cancel request goes to wrong PgBouncer → fails
Client gives up and closes socket
Server backend is still blocked on lock, hasn’t read the TCP close
Connection enters CLOSE_WAIT state on Postgres

In poison mode (red), CLOSE_WAIT remains at ~95 until transaction_timeout fires at t=60. In sleep mode (orange), CLOSE_WAIT clears around t=40 when idle_in_transaction_session_timeout fires.

1 PgBouncer and peers modes (purple/blue/green lines): Minimal or zero CLOSE_WAIT because cancel requests succeed—either routing to the single PgBouncer or being forwarded to the correct peer.

Connection Pool Wait Time vs PgBouncer Client Wait

Go’s database/sql pool tracks how long goroutines wait to acquire a connection (db.Stats().WaitDuration). PgBouncer tracks cl_waiting—clients waiting for a server connection. These metrics measure wait time at different layers of the stack.

This graph shows 2 PgBouncers in poison mode (nopeers)—the worst-case scenario:

Total Connections (brown) climb rapidly after poison injection at t=20 as failed cancels leave backends in CLOSE_WAIT
TPS (green) crashes to near zero and stays there until transaction_timeout fires at t=60
oldest_xact_age (purple) climbs steadily from 0 to 40 seconds
Once Postgres hits max_connections - superuser_reserved_connections (95), new connections are refused
PgBouncer #1 cl_waiting (red) then spikes as clients queue up waiting for available connections
AvgWait (blue) increases as workers wait for the non-blocked connections to become available

Note the gap between when transaction_timeout fires (t=60, visible as oldest_xact_age dropping to 0) and when TPS fully recovers. TPS recovery correlates with cl_waiting dropping back to zero—PgBouncer needs time to clear the queue of waiting clients and re-establish healthy connection flow. This recovery gap only occurs in nopeers mode; the TPS comparison graph shows that peers mode recovers immediately when the lock is released because connections never exhaust and cl_waiting stays at zero.

Why is AvgWait (blue) so low despite the system being in distress? The poisoned connection (holding the lock) continues executing transactions without blocking—it already holds the lock, so its queries succeed immediately. This one connection cycling rapidly through the pool with sub-millisecond wait times heavily skews the average lower, masking the fact that other connections are blocked.

The cl_waiting metric is collected as cnpg_pgbouncer_pools_cl_waiting from CloudNativePG. See CNPG PgBouncer metrics.

Detection and Prevention

Monitoring and Alerting:

Alert on:

Number of backends waiting on locks over some threshold
cnpg_backends_total showing established connections at a high percentage of max_connections
cnpg_backends_max_tx_duration_seconds showing transactions open for longer than some threshold (nb. long-running queries are often legitimate)

-- Count backends waiting on locks
SELECT count(*) FROM pg_stat_activity WHERE wait_event_type = 'Lock';

Prevention Options:

Options to prevent the root cause (connection pool poisoning):

Find and fix connection leaks in the application – ensure all transactions are properly committed or rolled back
Use OptionResetSession callback – automatically discard leaked connections (see below)
Fix at the driver level – PR #2481 adds automatic detection in pgx

Options to prevent the escalation from row-level brownout to system outage:

Enable PgBouncer peering – if using multiple PgBouncers behind a load balancer, configure the peer_id and [peers] section so cancel requests are forwarded to the correct instance (see PgBouncer documentation). This prevents connection exhaustion but does not prevent the TPS drop from lock contention.
Use session affinity (sticky sessions) in the load balancer based on client IP – ensures cancel requests route to the same PgBouncer as the original connection (see HAProxy Session Affinity example below)

Options to limit the duration/impact:

Set appropriate timeout defaults – configure system-wide timeouts to automatically terminate problematic sessions:
- idle_in_transaction_session_timeout – terminates sessions idle in a transaction (e.g., 5min)
- transaction_timeout (Postgres 17+) – use caution; limits total transaction duration regardless of activity (e.g., 30min)

Postgres:

Edit Feb 5: Postgres setting client_connection_check_interval enables dead connection cleanup.

Results Summary, Understanding the Layers Leading to the System Outage, Unique Problems, and more - available in the full writeup at https://github.com/ardentperf/pg-idle-test/tree/main/conn_exhaustion

KubeCon 2025: Bookmarks on Memory and Postgres

Jeremy — Sun, 16 Nov 2025 22:55:31 +0000

Just got home from KubeCon.

One of my big goals for the trip was to make some progress in a few areas of postgres and kubernetes – primarily around allowing more flexible use of the linux page cache and avoiding OOM kills with less hardware overprovisioning. When I look at Postgres on Kubernetes, I think there are idle resources (both memory and CPU) on the table with the current Postgres deployment models that generally use guaranteed QoS.

Ultimately this is about cost savings. I think we can still run more databases on less hardware without compromising the availability and reliability of our database services.

The trip was a success, because I came home with lots of reading material and homework!

Putting a few bookmarks here, mostly for myself to come back to later:

key place for discussion is sig-node
documentation on node-pressure eviction https://kubernetes.io/docs/concepts/scheduling-eviction/node-pressure-eviction/
- eviction signal thresholds can be customized
- it looks like priority classes give a lot of control over the order in which pods are evicted
documentation on priority classes https://kubernetes.io/docs/concepts/scheduling-eviction/pod-priority-preemption/
cgroups v2 memory controller documentation https://docs.kernel.org/admin-guide/cgroup-v2.html#memory
long running github issue about pod evictions due to kubernetes (incorrectly?) interpreting active page cache as working memory that won’t be reclaimed https://github.com/kubernetes/kubernetes/issues/43916
new feature MemoryQOS – still alpha (feature gate off-by-default)
- KEP https://github.com/kubernetes/enhancements/tree/master/keps/sig-node/2570-memory-qos/
  - currently stalled – related message from Linux Kernel Mailing Lists https://lkml.org/lkml/2023/6/1/1300
  - “Future: memory.high can be used to implement kill policies in for userspace OOMs, together with Pressure Stall Information (PSI). When the workloads are in stuck after their memory usage levels reach memory.high, high PSI can be used by userspace OOM policy to kill such workload(s).”
- Nov 2021 blog https://kubernetes.io/blog/2021/11/26/qos-memory-resources/
- May 2023 blog https://kubernetes.io/blog/2023/05/05/qos-memory-resources/
- Brief mention in docs https://kubernetes.io/docs/concepts/workloads/pods/pod-qos/#memory-qos-with-cgroup-v2
metrics added to CAdvisor for both active and inactive page cache https://github.com/google/cadvisor/pull/3445
metric added for PSI https://kubernetes.io/blog/2025/09/04/kubernetes-v1-34-introducing-psi-metrics-beta/
homework – taking a closer look at anonymous memory and page cache metrics (both active and inactive) for real postgres databases on kubernetes
homework – set up tests that emulate the diagram below and demonstrate the eviction behavior that i think will happen

I still have a lot of catching up to do. I sketched out the diagram below, but please take this with a large grain of salt – this aspect of kubernetes is complex and linux memory management is complex:

I tried to summarize some thoughts in a comment on the long-running github issue, but this might be wrong – it’s just what I’ve managed to piece together so far.

My “user story” is that (1) I’d like higher limit and more memory over-commit for page cache specifically – letting linux use available/unused memory as needed for page cache and (2) I’d like lower request to get scheduling closer to actual anonymous memory needs. I’m running Postgres. In the current state, I have to simultaneously set an artificially low limit on per-pod page cache (to avoid eviction) and artificially high request on per-pod anonymous memory (to avoid OOM by getting oom_score_adj). I’d like individual pods able to burst anonymous memory usage (eg. an unexpected SQL query that hogs memory), if we can steal from page cache of other pods beyond their request – avoiding OOM. The linux kernel can do this; I think it should be possible with the right cgroup settings?

It seems like the new Memory QOS feature might be assigning a static calculated value to memory.high – but for page cache usage, I wonder if we actually want kubernetes to dynamically adjust memory.high eventually as low as request in an attempt to reclaim node-level resources – before evicting end-user pods – when the memory.available eviction signal has exceeded the threshold?

Anyway it’s also worth pointing out that the postgres problems are likely accentuated by higher concentrations of postgres on nodes; if databases are spread across large multi-tenant clusters that likely mitigates things a bit.

Edit 11/29: Alexey Demidov replied on the github issue and pointed out the problem; the linux kernel throttles CPU of processes when we use memory.high so this probably makes my idea above ineffective.

Data Safety on a Budget

Jeremy — Sun, 05 Oct 2025 05:39:31 +0000

Many experienced DBAs joke that you can boil down the entire job to a single rule of thumb: Don’t lose your data. It’s simple, memorable, and absolutely true – albeit a little oversimplified.

Mark Porter’s Cultural Hint “The Onion of our Requirements” conveys the same idea with a lot more accuracy:

We need to always make sure we prioritize our requirements correctly. In order, we think about Security, Durability, Correctness, Availability, Scalability via Scale-out, Operability, Features, Performance via Scaleup, and Efficiency. What this means is that for each item on the left side, it is more important than the items on the right side.

But this does not tell the whole story. If we’re honest, there is one critical principle of equal importance to everything on this list: Don’t lose all your money.

Every adult who’s managed their own finances knows we don’t have infinite money. Yes we want to keep the data safe. We also want to be smart about spending our money.

Relational databases are one of the most powerful and versatile places to store your data – and they are also one of the most expensive places to store your data. Just look at the per-GB pricing of block storage with provisioned IOPS and low latency, then compare with the pricing of object storage. No contest. Any time a SQL database is beginning to approach the TB range, we definitely should be looking at the largest tables and asking whether significant portions of that data can be moved to cheaper storage – for example parquet files on S3. (Or F3 files?)

Of course, sometimes we need fast powerful SQL and joins and transactions. So relational databases also should run as efficiently as possible. This has direct implications around how we keep the data safe.

From personal photos to enterprise databases, the core of all data safety is copies of the data. Logs and row-store/column-store files (and indexes) are data copies in different formats. You could almost parse the entire database industry through a lense that compares how each technology is just a unique way to replicate data between different formats and places. The revered and time-honored “3-2-1 Backup Rule” is all about copies of the data. From an information theory standpoint, it can be argued that even RAID5 parity, checksums, CRCs, and hashes are a shadow or fingerprint “copy” of the original – even though they aren’t literal full copies of the data.

One of my favorite cultural hints from Mark is: Don’t Let Entropy Win.

In the absence of people making things better, they will get worse. It’s just a fact.

This isn’t Mark’s point, but I think it’s a related concept: at every business that’s successful enough to grow large, there is a natural gravitation toward forming silos of technology. I think of this as a kind of entropy that we need to actively counteract in every large business. Lets look at an example where an enterprise business team building a public API needs a 600GB write-intensive database. Suppose we can buy enterprise grade high-endurance NVMe SSDs (handling write-intensive database workloads) for $1000 each. How much will the storage cost to “keep the data safe” for this public API?

The business team provisions three environments: one for production and two more for development and testing.
For business continuity in case of regional problems, the database team creates primary and replica CloudNativePG clusters, so that we are able to run from either of our two regions.
To maintain high availability, the database team configures CloudNativePG with three instance within each region and they configure preferred anti-affinity so that kubernetes will attempt to schedule the three instances in different buildings or availability zones.
Persistent storage is provided by the storage team who configures ceph volumes backed by two mirror copies.
Object storage for backups uses two mirror copies.
Servers are built by the infrastructure team who configure RAID 1 (mirroring).

In the worst case, we can easily end up spending $96,000 on disks alone – for a database that can fit on a single $1000 enterprise drive! Now that is some crazy storage amplification.

In order to take a smarter approach, lets work backwards from the problems we’re solving. When we say “keep the data safe” – what are some specific situations we want to protect the data from?

Unavailability during maintenance & deployments at all levels of the stack
Operational mistakes
Software bugs at all levels of the stack, from business app to firmware
Hardware failures of disks
Hardware failures of servers/compute which can make good disks temporarily inaccessable
External threats from direct attacks, malware, social engineering, supply chain attacks, etc
Insider threats arising from situations like personal grievances or personal financial pressures
Natural disasters (and perhaps political disasters…)

Armed with a list, we can now ask ourselves: what is an economical solution that addresses everything here? There isn’t one right answer but we probably don’t need 12 physical copies of each database per data center. A few ideas:

Three CNPG instances that use local SSD storage directly (no hardware RAID), for a total of three copies in the primary data center.
Two or three CNPG instances that use either ceph block storage or local SSD with hardware RAID (but not both) for a total of four or six copies in the primary data center.
A single CNPG instance in the second data center, with the capability to dynamically add instances on switchovers/failovers.
Slower, less expensive disks for development databases.
No CNPG instance for immediate switchover/failover of development databases in second data center.
Testing tier that matches production config but can be provisioned on demand from backups for load testing, and deprovisioned when unused for some period of time. Development tier also provisioned on demand and deprovisioned when unused for some period of time.

There are many ways to keep data safe on a reasonable budget – these are just a few ideas.

Graviton2 versus Graviton4

Jeremy — Mon, 04 Aug 2025 05:23:11 +0000

Just a short post, because I thought this was pretty remarkable. Below, I have screenshots showing the CPU utilization of two AWS instances in us-west-2 which are running an identical workload.

They are running the CloudNativePG playground, which is a production-like learning and testing environment (all running virtually inside the single ec2 instance, which can be easily started and stopped or terminated and recreated). The standard CNPG playground setup consists of two Kubernetes clusters named kind-k8s-eu and kind-k8s-us. Each Kubernetes cluster contains a CloudNativePG cluster with HA between three postgres replicas running across three nodes/servers locally, and then there is cross-cluster replication from the EU cluster (primary) to the US cluster (standby).

What jumped out at me was the huge difference in CPU utilization! The Graviton2 instance runs maybe 40% utilization while the Graviton4 instance runs around 10% utilization.

I just now checked the AWS on-demand pricing page, and m6g.xlarge is $0.154/hr while m8g.xlarge is $0.17952/hour. That is a 16.6% increase in price, and for this particular workload it could be as much as a 300% increase in performance. At a fleet level, this should translate into significant cost saving opportunities if anyone adopted Graviton2 and if they are able to scale down overall instance counts based on better performance of newer generation chips.

Honestly, 40% utilization is technically fine for most of my own Kubernetes and Postgres experiments… but the 16.6% price increase is just low enough that I’ll probably start using the m8g instances anyway.

Things like this also underscore why it’s so hard to compare processors… how can we compare across different families, when we see differences like this between generations WITHIN a family?! Besides that, the total number of different processor choices we have today is overwhelming, taking into account all the different providers in the market. It’s a tough job.

Collation Torture Test versus Debian

Jeremy — Fri, 23 May 2025 05:37:21 +0000

Collation torture test results are finally finished and uploaded for Debian.

https://github.com/ardentperf/glibc-unicode-sorting

The test did not pick up any changes in en_US sort order for either Bullseye or Bookworm

Buster has glibc 2.28 so it shows lots of changes – as expected.

The postgres wiki had claimed that Jessie(8) to Stretch(9) upgrades were safe. This is false if the database contains non-english characters from many scripts (even with en_US locale). I just now tweaked the wording on that wiki page. I don’t think this is new info; I think it’s the same change that showed up in the Ubuntu tables under glibc 2.21 (Ubuntu 15.04)

FYI – the changelist for Stretch(9) does contain some pure ascii words like “3b3” but when you drill down to the diff, you see that it’s only moving a few lines relative to other strings with non-english characters:

@@ -13768521,42 +13768215,40 @@ $$.33
༬B༬
3B༬
3B-༬
-3b3
3B༣
3B-༣
+3B٣
+3B-٣
+3b3
3B3

In the process of adding Debian support to the scripts, I also fixed a few bugs. I’d been running the scripts from a Mac but now I’m running them from a Ubuntu laptop and there were a few minor syntax things that needed updating for running on Linux – even though, ironically, when I first started building these scripts it was on another Linux before I switched to Mac. I also added a file size sanity check, to catch if the sorted string-list file was only partly downloaded from the remote machine running some old OS (… realizing this MAY have wasted about an hour of my evening yesterday …)

The code that sorts the file on the remote instance is pretty efficient. It does the sort in two stages and the first stage is heavily parallelized to utilize whatever CPU is available. Lately I’ve mostly used c6i.4xlarge instances and I typically only need to run them for 15-20 minutes to get the data and then I terminate them. The diffs and table generation run locally. On my poor old laptop, the diff for buster ran at 100% cpu and 10°C hotter than the idle core for 20+ minutes and the final table generation took 38 minutes.

Note that the regular/default linux diff executable uses the classic Myers algorithm. This works if the changeset is very small but unfortunately a large changeset produced by this particular test suite will break the algorithm. When it was used to compare sorted lists from glibc 2.27 and glibc 2.28, the diff command ran at 100% CPU for 4 days before I gave up waiting. This was true with default flags and also with “–speed-large-files” flag. An easy way to get access to a different algorithm was to leverage git. It’s “histogram” algorithm completed the glibc 2.27 and 2.28 comparison in about 26 minutes. (Did you know git has multiple diff algorithms you can choose from?) There are a few other interesting tidbits tucked away comments of this github repo too… like careful coding in run.sh to support not only the latest unix operating systems but also stuff as old as bash v3.2.25 and perl v5.8.8

Reviewing the data, the en_US comparisons are accurate and very straightforward. It all starts with the same simple text file – 25 million strings – and it does unix “sort” of that file on the target OS before downloading it. Then we locally do a “diff” between the files. The raw output of the “diff” command is directly uploaded to github so that anyone can see the base data and check that summaries are accurate.

The script also downloads the OS locale data files from /usr/share/i18n/locales and does a recursive diff directly against the downloaded directories. The raw results of this diff are also uploaded to github. I was reviewing summary info in the README.md tables today and there might be a bug in the code that generates the summary? I’m not sure – but the underlying data and raw diff output is straightforward and available.

For anyone who’s interested in learning more about the background of this test suite, you can watch the recording of a presentation that Jeff Davis and I gave at 2024 PGConf.dev titled “Collations from A to Z” (putting words in order without losing your mind or your data).

And for those of you running Postgres on Docker… remember to pin the major version of your base image operating system!

Now if someone will please ping me a year from now to make me feel guilty about still not yet updating the tables that have ICU and RHEL collation changes

Challenges of Postgres Containers

Jeremy — Tue, 31 Dec 2024 10:52:17 +0000

Many enterprise workloads are being migrated from commercial databases like Oracle and SQL Server to Postgres, which brings anxiety and challenges for mature operational teams. Learning a new database like Postgres sounds intimidating. In practice, most of the concepts directly transfer from databases like SQL Server and Oracle. Transactions, SQL syntax, explain plans, connection management, redo (aka transaction/write-ahead logging), backup and recovery – all have direct parallels. The two biggest differences in Postgres are: (1) vacuum and (2) the whole “open source” and decentralized development paradigm… once you learn those, the rest is gravy. Get a commercial support contract if you need to, try out some training; there are several companies offering these. Re-kindle the curiosity that got us into databases originally, take your time learning day-by-day, connect with other Postgres people online where you can ask questions, and you’ll be fine!

Nonetheless: the anxiety is compounded when you’re learning two new things: both Postgres and containers. I pivoted to Postgres in 2017, and I’m learning containers now. (I know I’m 10 years late getting off the sidelines and into the containers game, but I was doing lots of other interesting things!)

Postgres was already one of the most-pulled images on Docker Hub back in 2019 (10M+) and unsurprisingly it continues to be among the most-pulled images today (1B+). Local development and testing with Postgres has never been easier. For many developers, docker run postgres -e POSTGRES_PASSWORD=mysecret has replaced installers and package managers and desktop GUIs in their local dev & test workflows.

With the widespread adoption of kubernetes, the maturing of its support for stateful workloads, and the growing availability of Postgres operators – containers are increasingly being used throughout the full lifecycle of the database. They aren’t just for dev & test: they’re for production too.

Containers will dominate the future of Postgres, if only because I bear the scars of managing 15-year-old servers where the package manager database never matched reality and there were 20 different copies of python and 30 different copies of java installed under various root and user directories.

But what exactly is a container? What is inside that thing? In fact, a lot more than I first thought. Six months ago I was convinced there’s no possible way glibc was in that container. You can’t just take a glibc from 2024 and run it on a kernel from 2016. Right?

$ docker run --interactive --tty debian:bookworm-slim

root@27234bdf966e:/# dpkg -l libc6
Desired=Unknown/Install/Remove/Purge/Hold
| Status=Not/Inst/Conf-files/Unpacked/halF-conf/Half-inst/trig-aWait/Trig-pend
|/ Err?=(none)/Reinst-required (Status,Err: uppercase=bad)
||/ Name           Version        Architecture Description
+++-==============-==============-============-=================================
ii  libc6:amd64    2.36-9+deb12u9 amd64        GNU C Library: Shared libraries

root@27234bdf966e:/# dpkg -l|grep "  lib"|wc -l
51

root@27234bdf966e:/# dpkg -l|grep -v "  lib"|wc -l
42

root@27234bdf966e:/# exit

Basically, there can be a whole operating system inside that container! (Minus the kernel.) In practice there are a range of “base operating systems” from hyper-slim alpine (a la busybox) to containers that run a full copy of systemd and provide a full operating system experience. Docker’s official Debian-based Postgres containers use the “slim” debian OS container as a base (88 packages and 74MB) and are customized with additional packages from PGDG and the Debian universe (total 146 packages and 434MB).

The glibc-to-kernel cross-compatability is magic to me. It’s not by chance. Libraries like glibc are pretty tightly coupled to the kernel, and it’s an intentional effort by both linux kernel maintainers and glibc maintainers to maintain cross-compatibility. Like the intentional effort by Postgres maintainers to maintain ABI compatibility across Postgres “minor release” bugfix versions.

Combined with a good kubernetes operator, Postgres containers are production ready today.

But containers have a few rough edges. It’s important to know about them if you’re going to move toward production operations with Postgres containers.

Security and Isolation

Containers are secure enough for the vast majority of companies and use cases. The underlying technology is well maintained, new vulnerabilities are addressed promptly, fixes are made available quickly, and designs are thoroughly reviewed. Kernel isolation capabilities have been tested by world-class pen testers and red teams.

However there is a meaningful difference between kernel isolation and hardware VM-based isolation. The firecracker paper presented at Usenix 2020 is the best writeup that I’ve seen on the topic so far.

Firecracker: Lightweight virtualization for serverless applications. Amazon Science (2020).

Fundamentally it’s about attack surface area within the boundary between unprivileged and privileged execution. At the end of the day, a general-purpose operating system kernel’s syscall interface is composed of hundreds of critical functions with complex implementations. Virtual Machine Monitors (VMMs) and processor instruction sets are comparatively simpler with better-understood abstractions. Virtualization is not immune from attack – recent incidents like Meltdown and Spectre and other side-channel/speculative-execution attacks have proven the point – but reducing attack surface area is fundamental in very-high-security environments.

The vast majority of companies should not be disabling SMT on their processors or avoiding containers. There is sometimes a trade-off between security and cost/performance. Hyperscalers and SAAS companies have use cases where they have to opt for virtualization even when it’s less cost-effective.

Most readers here can deploy with traditional containers. Just understand the reasoning and be cognizant of the choice.

Host/Node Operating System Compatibility

You can’t just take a glibc from 2024 and run it on a kernel from 2016. Right?

The answer is actually a little more nuanced than you’d expect.

The terminology used by Scott McCarty on the Red Hat blog around 2019-2020 is portability, compatibility and supportability. Scott’s a product manager who is particularly concerned about commercial contracts between Red Hat and its customers. The term supportability is an explicit reference to “scope of what Red Hat fully commits to debug and fix, as part of what you are paying for”.

But I think the terms are helpful even for people who just want to run their business on containers and are not Red Hat customers.

The standardized file format of containers makes them highly portable across systems and software. OCI-compliant containers can be copied and understood by tooling anywhere, but that doesn’t mean they can run anywhere.

Compatibility is about where containers will run. Naturally, if you compile for ARM then it’s not going to run on x86. I don’t fully understand yet how compatibility works across operating systems (linux, mac and windows)… I think there has been some clever engineering in recent years to create more compatibility here than what used to exist.

Things seem to get more interesting across different versions of the linux kernel.

Internet forums are full of people pointing out that the Linux kernel APIs are decades old and change rarely so your containers will probably run fine on any Linux. But you also don’t need to go far to find examples of things that break, like centos:6 bash crashing on Ubuntu 18.04 or useradd failing when the host is upgraded to RHEL 7 (and continuing to work fine on RHEL 6).

Even if you don’t have any intentions of becoming a Red Hat customer, I think it’s informative to read their official container support policy and their official container compatibility matrix. In particular: their “workload-specific” guidelines for container compatibility:

Run as an unprivileged container (ie. don’t pass the --privileged flag)
Do not interact directly with kernel-version-specific data structures (ioctl, /proc, /sys, routing, iptables, nftables, eBPF, etc) or kernel-version-specific modules (KVM, OVS, SystemTap, etc.)

That’s good advice. It should mostly keep you out of trouble. Don’t forget it’s not just about your code, but also about your dependencies – even debian packages & binaries you pull into your container.

I think Postgres and most Postgres extensions should be fairly safe. They may not always strictly follow the rules above, but I think if any problems are found in core postgres (or a widely used extension) they’re likely to be taken seriously. The Postgres community generally tends to value portability & compatibility.

Red Hat generally recommends building containers with the same base OS major version as the host where they run. My own opinion is to stay “close”. Stick with major distro versions released within a few years of each other. Just my opinion – but I would probably look at distro release date over linux kernel version, given how aggressively kernels are sometimes patched by distros.

Container Versioning and Change Management

Lies. Containers don’t actually have versions. They have tags.

In Postgres, and for that matter any major linux distribution, if I ask for a specific version today – and then I ask for the same version next week – I will get the same bits. In fact, the official Debian Policy Manual section 3.2.2 codifies what I thought was common sense:

The part of the version number after the epoch must not be reused for a version of the package with different contents once the package has been accepted into the archive, even if the version of the package previously using that part of the version number is no longer present in any archive suites.

Containers don’t work like this at all. Practically every example on the internet makes it look like you can ask for a specific version of Postgres with docker run postgres:17.2 – but it turns out that 17.2 is just an arbitrary tag and not really a version number.

The docs are clear that it’s just a tag, but it’s all very confusing to newcomers – and there are dangers lurking here with Postgres.

The biggest danger is around the now-infamous glibc collation problems.

As early as 2017, a user of containerized Postgres 9.5 switched from tag 9.5 to tag 9.5-alpine and their data seemed to disappear. I suspect this was likely related to collation.

Debian v10/Buster was released in 2019 (with the big scary glibc change), and the docker community hit the brakes on updating their images due to the known problems. Finally in 2021 they caved in and added a bunch of complexity to their build scripts, in order to start building for two major debian versions at once. And thus was born the tags 10-stretch and 10-buster. The community instituted a policy of supporting only the two most recent major versions of debian (stable and oldstable). The “default” tags where no OS is specified (eg 17 or 17.2) change which major OS they are pointing to. This has resulted in a steady stream of problem reports, every time a new debian major was released.

Debian v11/Bullseye was released Aug 14, 2021. On Nov 10 a GitHub issue was opened from a user seeing incorrect sort order in Russian. Debian v12/Bookworm was released on June 10, 2023. On June 15 GitHub issue was opened by a user getting collation version mismatch warnings, and the torture test scan indicates this jump (2.31 to 2.36) likely includes changes in the Oriya and Kurdish languages (in 2.32). I haven’t yet checked if ICU has changes in bookworm.

The takeaway is summarized well in the GitHub Issue:

It is possible to completely avoid surprise changes when deploying containers: image digests can be used instead of image tags. But I think in most cases, using the tags as described above is the best solution. The tag postgres:15-bookworm is locked onto the Debian and Postgres stable releases, so you’ll automatically get security and critical updates by using a tag like this. Just make sure to include the operating system part!

And remember that you can’t just switch the tag to a new Operating System version unless you want to risk corruption. If you want to be 100% safe then you need to logically pg_dump-and-load, or use logical replication to move your data to the new operating system container image, or set your default provider to the new pg17 builtin C collation and use linguistic collation at a query or table level when needed and rebuild /all/ dependent objects on OS changes.

Memory Management

https://github.com/kubernetes/kubernetes/issues/43916 has now been open for 7 years and has 141 comments. Still going strong this month.

A few folks referenced https://github.com/linchpiner/cgroup-memory-manager as a workaround, but I’m not sure whether I’d use this with Postgres… at present I think the safest option with postgres remains the request==limit configuration.

An engineer from Bucharest named Mihai Albert wrote a very interesting blog post a few years ago that digs into detail on the behavior. I think his blog might be based on cgroups v1. I hadn’t seen it before, but it’s referenced from that GutHub issue. https://mihai-albert.com/2022/02/13/out-of-memory-oom-in-kubernetes-part-4-pod-evictions-oom-scenarios-and-flows-leading-to-them

I took my first swing at consolidating, organizing and writing what I learned back in September: Kubernetes Requests and Limits for Postgres … still a lot I don’t know or haven’t dug into yet.

Overall, I can’t quite tell whether there are any actual kubernetes code improvements on the horizon yet. We might still be in the “digging and discussing” stage.

Summary

Combined with a good kubernetes operator, Postgres containers are production ready today. I’m not sure whether I’d go learn kubernetes just to run Postgres but the reality is that kubernetes is already in use at many companies for application workloads. If kubernetes is already being deployed, then learning and leveraging it for Postgres makes sense.

2024 has been an exciting year for me and I’m very happy for the opportunity to begin really digging into Postgres containers. My four main concerns are outlined here, and they aren’t dampening my enthusiasm.

TLDR (I’m relatively new to containers, so hopefully I’m getting these right):

don’t be intimidated to learn postgres if you’re a DBA for another database – it’s easier than you think
set your database default collation to the new v17 builtin C collation, and use ICU at a table or query level in cases where you need linguistic collation. rebuild only those objects on OS changes.
include the OS in your container “version” tags when deploying postgres containers
deploy on a host/node with an OS major version released within a few years of your base container image (if the same OS major version, then all the better)
for now, stick with request==limit for kubernetes memory allocations.

Postgres containers solve problems that have haunted us for decades, and they are here to stay.

Kubernetes Requests and Limits for Postgres

Jeremy — Mon, 23 Sep 2024 01:02:39 +0000

As Joe Drumgoole said a few days ago: so many Postgres providers. Aiven, AWS, Azure, Crunchy, DigitalOcean, EDB, GCP, Heroku, Neon, Nile, Oracle, Supabase, Tembo, Timescale, Xata, Yugabyte… I’m sure there’s more I missed. And that’s not even the providers using Postgres underneath services they offer with a different focus than Postgres compatibility. (I noticed Qian Li’s upcoming PGConf NYC talk in 2 weeks… I have questions about DBOS!)

Kubernetes. I have a theory that more people are using kubernetes to run Postgres than we realize – even people on that list above. Neon’s architecture docs describe their sprinkling of k8s stardust (but not quite vanilla k8s; Neon did a little extra engineering here). There are hints around the internet suggesting some others on that list also found out about kubernetes.

And of course there are the Postgres operators. Crunchy and Zalando were ~~first~~ out of the gate in 2017. But not far behind, we had ongres and percona and kubegres and cloudnativepg.

Edit Nov 2: The first out of the gate was stolon in 2015. I missed it when I originally published this article.

We are database people. We are not actually a priesthood (the act is only for fun), but we are different. We are not like application people who can spin a compute container anywhere and everywhere without a care in the world. We have state. We are the arch enemies of the storage people. When the ceph team says they have finished their fio performance testing, we laugh and kick off the database benchmark and watch them panic as their storage crumbles under the immense beating of our IOPS and their caches utterly fail to predict our read/write patterns. (I jest. We don’t really surprise each other that much anymore, except for occasional harmless office pranks.)

But we all have at least one thing in common: none of us want to pay for a bunch of servers to sit around and do nothing, unless it’s really necessary. Since the dawn of time. From mainframes to PowerVM to VMware and now to kubernetes. We’re hooked on consolidating better and saving more money and kubernetes is the best drug yet.

In kubernetes, you manage consolidation with two things: requests and limits.

The Production Kubernetes book says this is a common question in the field.

Because we’re hooked, I’ve noticed that in kubernetes circles there are proponents of no limits. There are counterarguments too. But for modern distributed microservice applications there is an argument. Especially if pods can be rescheduled (ie. shut down on one node and started on a different node) in a way that’s non-disruptive as traffic is seamlessly rerouted.

Vanilla open source Postgres is not multi-master. Only one server is allowed to make changes to data. There is the open source BDR project which does active-active with logical replication (and requires careful setup, good understanding of tradeoffs, and much more hands-on operations) and there’s some interesting commercial development happening in this space – but today there is nothing that approaches the seamless experience we’d like for shutting down a Postgres node without some connections getting errors and/or rolled back transactions. On vanilla open source Postgres you can’t avoid at least a few seconds of unavailability during the failover. You can build your application to attempt reconnections and hide the errors; but that’s not trivial and it’s only useful for greenfield apps where you’re willing to do the extra dev work.

For this reason, I get a sense that most postgres+kubernetes people are definitely not in the “no limits” camp. I’m still asking around, but anecdotally I’ve heard the other extreme – the “request==limit” mentality – and leaving idle resources on the table is just the price we pay to avoid periodic unavailability and errors from a kubernetes scheduler decision to failover your production database (say, during peak business hours, because someone kicked off end-of-month reporting in a different production pod on the same node).

Can we do better? Can we do some oversubscription and find a way to reduce the risks?

Lets just start with dev machines, where we can care a little bit less about a few seconds of unavailability, as long as k8s rescheduling isn’t too frequent.

CPU is fairly easy to reason about. But it’s harder to figure out how memory pressure will work, and how to size the buffer cache if we want to experiment with “request

We need to solve memory because that will be the limiting factor for consolidation. If we set buffer cache and request size on the smaller side, and then put a bunch of developer DBs on a single machine hoping for burst, is the linux kernel very good about sensing demand and increasing memory for a pod to give it more memory for page cache when some developer is actively using their database? And will it release that memory under reduced load to avoid triggering rescheduling events?

There are two things that can cause a pod to be terminated for memory reasons: (1) the kubernetes scheduler and (2) the linux kernel OOM killer. Both are important. Based on my initial digging around, I think these notes are accurate:

k8s control plane scheduler uses “request” to schedule pods. it will put lots of pods on a node if requests are low.
linux kernel uses k8s “limit” setting to set cgroups limits.
on an interval of 10+ seconds, k8s control plane scheduler also has ability to evict & reschedule based on a memory metric. from checking references (1) (2), looks like its using cgroups v2 memory.current which “includes page cache, in-kernel data structures such as inodes, and network buffers” and then it’s subtracting the size of the inactive page list. It assumes memory from inactive_file is reclaimable under pressure.
node memory is also managed by linux kernel OOM killer. once a bunch of pods are running, if they start using a lot memory before the k8s scheduler takes action, then OOM killer can kick in.
important to also note that request==limit puts a pod in a different “QoS class” ensuring that other PODs should evicted first by the k8s scheduler. ~~i don’t think this influences OOM killer which could still be triggered by a rapid workload spike.~~ kubelet also adjusts OOM killer behavior based on QoS classes.

Linux kernel memory management is tightly integrated with hardware capabilities. Modern processor memory management units (MMUs) automatically set bits in the page table entry (PTE) when pages are accessed or dirtied – and linux leverages these hardware features in its page eviction algorithm. If you’re not familiar with linux kernel memory management, you can start with the kernel docs at https://www.kernel.org/doc/gorman/html/understand/understand013.html (and if you’re new, LLMs are great for explaining a sentence or a word on pages like this)

Per that page: “The LRU in Linux consists of two lists called the active_list and inactive_list. The objective is for the active_list to contain the working set of all processes and the inactive_list to contain reclaim candidates.”

For a little more depth, this is good https://biriukov.dev/docs/page-cache/4-page-cache-eviction-and-page-reclaim/

That page describes how each cgroup has a two sets of active/inactive lists. One active/inactive set for anonymous memory and a separate active/inactive set for the page cache. (It might be even a little more complicated; there might be processor-local lists?)

My hope is kindled. If the page cache “working set” is factored into cgroup memory management and k8s scheduler decisions, then there’s a chance that postgres doing a lot of file reads would be correctly interpreted by the linux kernel and k8s, allowing a cgroup to increase its memory usage (and page cache) up to the memory limit, and allowing the cgroup to release some of that memory if the system is idle.

However, based on my non-expert understanding of how the active/inactive lists work, I worry that Postgres might bias toward keeping pages on the active list and using more memory even under smaller loads. After all – deciding what the phrase “working set” means in database-land is full of nuance, and the linux memory management algorithm is relatively naive. So this might work for almost totally idle dev containers, but if people are using their databases then it might trigger k8s rescheduling more than we’d like – rather than just letting some dev pods do a little more IO. Or the opposite problem – not increasing memory available to the page cache even under heavy IO. I don’t know. And either way, I don’t think linux exposes any tuning knobs for this?

The journey continues. I still have a lot to learn and figure out here.

Edit 7:30pm: I’d read Joe Conway’s 2021 article about this topic before, but last time I read it I was mostly interesting in the cgroup bits and not the kubernetes bits. There’s also a 2022 follow-up article. Honestly I completely forgot it was mainly about kubernetes before publishing this, and I was just reminded. That’s a bit embarrassing. Re-reading it now :)

Edit 9:00pm:

Joe’s 2021 article highlighted that OOM problems were frequently being seen a few years back. His article is also eye-opening about some kubernetes quirks I hadn’t stumbled on yet. I was a bit familiar with cgroups v1 and v2, linux namespaces, overlaying filesystems, and some other underlying concepts – but I started directly learning kubernetes somewhat recently.

A few follow-up things I wonder after reading Joe’s article:

I’m aware there are folks in the Postgres community who think overcommit should be disable, but I lean toward disagreeing with this approach (and I might argue whether there’s consensus on the idea)
- HOWEVER – “Kubernetes actively sets vm.overcommit_memory=1” worries me. I agree with Joe that this promiscuous overcommit doesn’t seem right.
“an OOM kill can happen even when the host node does not have any memory pressure. When the memory usage of a cgroup (pod) exceeds its memory limit, the OOM killer will reap one or more processes in the cgroup.”
- I think that linux will still try to evict candidate pages (like dropping clean pagecache pages) before invoking OOM killer? Isn’t the issue is just that it will NOT error out a malloc() call with “out of memory”? Erroring out malloc() is absolutely the behavior we want (and I think we can still get it with vm.overcommit_memory=0 under heavy pressure). That causes a single query to fail rather than crashing the whole database.
Running without swap seems like a not-great-idea to me too – and I didn’t realize back in 2021 that was the only option for kubernetes. It sounds like they added swap support but now I need to go figure out current state and whether this is enabled by default.
It’s interesting to me that back in 2021 Joe recommended choosing between “request==limit” (with 2x over-provisioning of pod memory) or “no limit” (with 2x over-provisioning of node memory). The amount of suggested memory over-provisioning makes me
He also didn’t directly address how to set the buffer_cache for “request<>limit” cases. Off the top of my head, I think something like 50% of “request” might be a reasonable starting point.
I think Joe’s blog is based on cgroups v1 and I will need to review the changes in cgroups v2 to see how much it changes the picture specifically on memory management.

This also led me to re-read the kubernetes “Node-pressure Eviction” doc page more thoroughly (also was linked above). A few key notes:

Pod selection for kubelet eviction – Pod Priority is a major factor in eviction, so maybe can use this to make burstable database pods evicted after burstable non-database pods.
Node out of memory behavior – explains exactly how QoS class changes OOM behavior. Joe’s 2021 article also gives more in-depth explanation about this.
Schedulable resources and eviction policies – a way to tell kubernetes scheduler to get involved sooner, which may help reduce OOM risks

But the most significant thing is the very last section of the page.

Known issues: active_file memory is not considered as available memory

The active_file statistic is the cgroup equivalent of Active(File) in meminfo which 100% means the page cache. This is agreeing with what I said above: “Postgres might bias toward keeping pages on the active list” and heavy I/O could aggressively trigger kubernetes rescheduling events. This doc suggests “request==limit”. We could still really use that kernel enhancement Joe mentioned – so that when someone runs a query and explicitly tells it to do a big in-memory sort, Postgres can error the query (via malloc error) instead of crashing the node.

The journey continues. I still have a lot to learn and figure out here. So far this was mostly reading and reviewing past work… next should be some data collection.

Linux – Ardent Performance Computing

Openclaw is Spam, Like Any Other Automated Email

The Scott Shambaugh Situation Clarifies How Dumb We Are Acting

Postgres client_connection_check_interval

How Blocking-Lock Brownouts Can Escalate from Row-Level to Complete System Outages

The Problem in Brief

Table of Contents

Architecture

The Test Scenarios

PgBouncer Count: 1 vs 2 (nopeers mode)

Failure Mode: Sleep vs Poison

Pool Mode: nopeers vs peers (2 PgBouncers)

Summary

Test Results

Transactions Per Second

TCP CLOSE-WAIT Accumulation

Connection Pool Wait Time vs PgBouncer Client Wait

Detection and Prevention

KubeCon 2025: Bookmarks on Memory and Postgres

Data Safety on a Budget

Graviton2 versus Graviton4

Collation Torture Test versus Debian

Challenges of Postgres Containers

Security and Isolation

Host/Node Operating System Compatibility

Container Versioning and Change Management

Memory Management

Summary

Kubernetes Requests and Limits for Postgres