Almost everyone can define read-your-writes consistency. Far fewer can tell you how to actually build it on a master/replica cluster without melting the master. This post is about the how — dissected straight out of a real Atlassian engineering write-up on how Bitbucket scaled PostgreSQL.
We'll do three things, in order. First, pin down precisely what read-your-writes consistency is (and what it is not — it is not strong consistency, and the difference matters). Second, understand the one prerequisite that makes the solution obvious: how PostgreSQL replication actually works. Third, walk through Bitbucket's implementation line by line — the LSN trick, the Redis store, the middleware, the exact SQL, and what it cost them.
The setup — a very ordinary architecture under a very heavy load
Bitbucket is a code-hosting and review tool — think GitHub. We don't care about its internals; we care about its database tier, which looks like almost every read-heavy service on earth:
- One master PostgreSQL instance takes all writes (and the critical reads that must be perfectly fresh).
- A pool of replicas takes reads that can tolerate slight staleness.
- Two Django services — the public REST API and the Bitbucket web UI — each handle millions of requests per hour.
- A connection pool sits in front of Postgres, because Postgres forks a whole OS process per connection (not a thread), which is far too expensive to do per request.
So the replicas exist precisely to absorb that read load. The problem is that the moment you route reads to replicas, you risk violating the one guarantee a code-review tool cannot break: showing a user stale data immediately after they changed something.
What read-your-writes actually means
Let's be precise, because two different guarantees get conflated constantly.
Strong consistency: once your write is committed, any reader — you or anyone else — sees the new value. No one ever observes the old one again.
Read-your-writes consistency: a deliberately relaxed version. The only promise is to you: if you wrote it, and then you read it, you get your fresh value. If some other user reads the same key a moment later and briefly sees the old value (or no value yet), that's acceptable.
For Bitbucket, read-your-writes is exactly the right target. If you edit an issue or push a change and immediately reload, you must see your edit. A teammate seeing it 50 ms later is fine. That gap is the entire budget the solution gets to play with.
Why a master/replica split breaks it
On a single node, read-your-writes is automatic: writes and reads hit the same data, so a read always sees the latest write. There is nothing to solve.
Add replicas and the trouble appears. Imagine all reads and writes for one key — say a single Jira/Bitbucket issue ID:
The write lands on the master at t=0. The user's follow-up read is routed to a replica at t=8ms. But replication is fast, not instantaneous — the replica doesn't actually have the write until t=30ms. In that window the read returns stale data, and read-your-writes is violated. That window is replication lag, and it is the enemy.
From this we can distill the rule we actually need to enforce:
The one prerequisite: how PostgreSQL replication works
The solution is counterintuitive until you understand replication, and then it's almost obvious. Here is the whole mechanism.
When a write hits the master, two things happen: the master applies the change to its own copy of the data, and it appends a record describing that change to its Write-Ahead Log (WAL) — an append-only file. Each WAL record is stamped with a Log Sequence Number (LSN): a monotonically increasing position in the log. Entry after entry, the LSN only ever goes up.
Replication is pull-based: each replica continuously streams the WAL from the master and replays those records, in order, onto its own copy of the data. Because replay is strictly in LSN order, a replica is always caught up to some LSN and no further — it has applied everything up to, say, LSN 102, and nothing after.
And there's the whole insight, sitting in plain sight. Because LSNs are monotonic, a single comparison answers "does this replica have my write?":
102 < 129 — that replica definitely does not have your write. A replica reporting 129 or higher definitely does. No guessing, no timestamps, no clocks — just two monotonically increasing integers.Bitbucket's implementation
Now the solution writes itself. Two halves: capture the LSN on write, consult it on read.
On write — remember where the user landed
When a user commits a write, the change is applied on the master and a WAL entry is created. A Django middleware then asks Postgres for the LSN of that write and stores it, keyed by the user:
The store is just Redis holding { atlassian_user_id → last_write_lsn }. That's the entire write-side state: for each user, the LSN of their most recent write. (Newer writes only push the LSN up, since it's monotonic, so a simple overwrite is correct.)
On read — pick a replica that has caught up
At the start of every request, after authentication (so the user ID is known), the middleware decides which connection this request's reads will use:
- Fetch the user's last-write LSN from Redis. (No entry? The user hasn't written recently — any replica is fine.)
- Ask each replica for its current replay LSN.
- Keep the replicas whose replay LSN
≥the user's LSN — these provably contain the user's write. - Pick one of those caught-up replicas (at random, to spread load) as this request's read replica. The master connection is always available as the fallback.
The request then proceeds normally: its read queries go to the chosen caught-up replica, its writes go to the master. The user is guaranteed to see their own writes — without those reads ever touching the master.
The actual SQL
Two pieces of information drive the whole thing, and Postgres exposes both directly.
On the master — "what's the current LSN?" This is the position you store after a user's write:
-- Current WAL write position on the primary (PostgreSQL 10+)
SELECT pg_current_wal_lsn();
-- pg_current_wal_lsn
-- --------------------
-- 0/16B3748 <- an LSN, stored per-user in Redis
On each replica — "how far have you replayed?" You don't want a raw integer subtraction on LSNs (they're a structured 64-bit value, not a plain counter), so Postgres ships a helper, pg_wal_lsn_diff(), to compute the byte distance between two LSNs safely:
-- Run on a replica: is it caught up to the user's write LSN?
-- Positive result => replica's replay position is at or beyond the user's LSN.
SELECT pg_wal_lsn_diff(
pg_last_wal_replay_lsn(), -- how far this replica has replayed
:user_lsn -- the LSN we stored for this user
) >= 0 AS caught_up;
-- caught_up
-- -----------
-- t <- safe to route this user's reads here
That boolean — caught_up — is the entire routing primitive. Run it against each replica, keep the ones that return true, pick one.
pg_current_xlog_location(), pg_last_xlog_replay_location(), and pg_xlog_location_diff(). PostgreSQL 10 renamed everything from xlog → wal (to stop people confusing the write-ahead log with regular logs). Same functions, newer spelling — use the wal versions on any modern Postgres.Sketching the middleware logic around those queries:
# Django middleware — runs at the start of each request, post-auth.
def choose_read_connection(user_id):
user_lsn = redis.get(f"lsn:{user_id}") # 1 Redis read
if user_lsn is None:
return random.choice(replicas) # no recent write: any replica
caught_up = []
for r in replicas: # 1 cheap query per replica
if r.query("SELECT pg_wal_lsn_diff("
"pg_last_wal_replay_lsn(), %s) >= 0", [user_lsn]):
caught_up.append(r)
return random.choice(caught_up) if caught_up else master # fall back to master
How expensive is this, really?
It looks heavy — an extra Redis call and a query against every replica on each request. The surprise is how cheap it actually is.
Per request you pay: one Redis read (sub-millisecond) to fetch the user's LSN, plus one tiny query per replica to read its replay position. Those replica checks are trivial (a single function call, no table access) and can be fired in parallel. With a handful of replicas at ~1–2 ms each, Bitbucket measured the total added latency at roughly 10 ms per request — an overhead Atlassian found entirely tolerable.
And the payoff was large. By routing the vast majority of reads to caught-up replicas — with only the genuinely un-replicatable reads and all writes going to the master — Bitbucket cut requests to the primary by about 50%. The master got dramatically more headroom: more stable, more performant, and a candidate to scale down if they chose, all while the replicas finally earned their keep.
The lesson
Read-your-writes consistency isn't a property you turn on — it's a routing decision you make on every request, and the database already knows the answer.
The trick that makes it cheap is the WAL's log sequence number: a single monotonically increasing integer that lets you ask any replica "are you caught up to my last write?" and get a definitive yes or no. Store each user's write LSN in Redis, compare against each replica's replay LSN, route the reads to a replica that has caught up, and you get the guarantee that matters — you always see your writes — while keeping the firehose of reads off your master. Ten milliseconds of overhead, half the load gone. That's a trade most systems should be glad to make.