How Bitbucket Reads Your Own Writes Without Hammering the Master — LSN Tracking, Replica Lag, and a Redis Map

TL;DR

On a single database, reading what you just wrote is free. Split writes to a master and reads to replicas, and replication lag breaks it: your read can hit a replica that hasn't received your write yet. Sending all reads to the master fixes consistency but throws away the whole point of having replicas. Atlassian's Bitbucket team solved it with a beautifully cheap trick: when a user writes, capture the master's WAL LSN (a monotonically increasing position number) and stash {user → lsn} in Redis. On the next read, a Django middleware asks each replica "what LSN have you replayed?" and routes the request only to a replica that has caught up past the user's write — falling back to master otherwise. Cost: ~10 ms of overhead per request. Payoff: ~50% fewer reads on the master. This is read-your-writes consistency, implemented for real.

Tags: Consistency · Read-Your-Writes · PostgreSQL · Replication · WAL · LSN · Scaling · Reading time: ~18 min · Category: System Design → Architecture Decisions

Almost everyone can define read-your-writes consistency. Far fewer can tell you how to actually build it on a master/replica cluster without melting the master. This post is about the how — dissected straight out of a real Atlassian engineering write-up on how Bitbucket scaled PostgreSQL.

We'll do three things, in order. First, pin down precisely what read-your-writes consistency is (and what it is not — it is not strong consistency, and the difference matters). Second, understand the one prerequisite that makes the solution obvious: how PostgreSQL replication actually works. Third, walk through Bitbucket's implementation line by line — the LSN trick, the Redis store, the middleware, the exact SQL, and what it cost them.

The source. This is a dissection of Atlassian's own blog post about scaling the Bitbucket database. Everything here mirrors the architecture they describe; the goal is to make the mechanism concrete enough that you could build it yourself.

The setup — a very ordinary architecture under a very heavy load

Bitbucket is a code-hosting and review tool — think GitHub. We don't care about its internals; we care about its database tier, which looks like almost every read-heavy service on earth:

One master PostgreSQL instance takes all writes (and the critical reads that must be perfectly fresh).
A pool of replicas takes reads that can tolerate slight staleness.
Two Django services — the public REST API and the Bitbucket web UI — each handle millions of requests per hour.
A connection pool sits in front of Postgres, because Postgres forks a whole OS process per connection (not a thread), which is far too expensive to do per request.

The load multiplier. A single user request fans out into 10+ database queries. Multiply by millions of requests per hour and the database tier — especially the master — becomes the scalability bottleneck for the whole product. Every read you can safely keep off the master is real money.

So the replicas exist precisely to absorb that read load. The problem is that the moment you route reads to replicas, you risk violating the one guarantee a code-review tool cannot break: showing a user stale data immediately after they changed something.

What read-your-writes actually means

Let's be precise, because two different guarantees get conflated constantly.

Strong consistency: once your write is committed, any reader — you or anyone else — sees the new value. No one ever observes the old one again.

Read-your-writes consistency: a deliberately relaxed version. The only promise is to you: if you wrote it, and then you read it, you get your fresh value. If some other user reads the same key a moment later and briefly sees the old value (or no value yet), that's acceptable.

Strong consistency is a promise to everyone. Read-your-writes is a promise only to the person who made the change — and that smaller promise is dramatically cheaper to keep.

For Bitbucket, read-your-writes is exactly the right target. If you edit an issue or push a change and immediately reload, you must see your edit. A teammate seeing it 50 ms later is fine. That gap is the entire budget the solution gets to play with.

Why a master/replica split breaks it

On a single node, read-your-writes is automatic: writes and reads hit the same data, so a read always sees the latest write. There is nothing to solve.

Add replicas and the trouble appears. Imagine all reads and writes for one key — say a single Jira/Bitbucket issue ID:

The write lands on the master at t=0. The user's follow-up read is routed to a replica at t=8ms. But replication is fast, not instantaneous — the replica doesn't actually have the write until t=30ms. In that window the read returns stale data, and read-your-writes is violated. That window is replication lag, and it is the enemy.

From this we can distill the rule we actually need to enforce:

The rule. Reads before a user's write may freely go to any replica. But every read after that user's write must go to a node that already has the write — either the master (always current) or a replica that has caught up to it. Bitbucket chose the second option, because the master was already the bottleneck.

The one prerequisite: how PostgreSQL replication works

The solution is counterintuitive until you understand replication, and then it's almost obvious. Here is the whole mechanism.

When a write hits the master, two things happen: the master applies the change to its own copy of the data, and it appends a record describing that change to its Write-Ahead Log (WAL) — an append-only file. Each WAL record is stamped with a Log Sequence Number (LSN): a monotonically increasing position in the log. Entry after entry, the LSN only ever goes up.

Replication is pull-based: each replica continuously streams the WAL from the master and replays those records, in order, onto its own copy of the data. Because replay is strictly in LSN order, a replica is always caught up to some LSN and no further — it has applied everything up to, say, LSN 102, and nothing after.

And there's the whole insight, sitting in plain sight. Because LSNs are monotonic, a single comparison answers "does this replica have my write?":

The comparison that solves everything. If your write produced LSN 129, and a replica reports it has replayed up to LSN 102, then 102 < 129 — that replica definitely does not have your write. A replica reporting 129 or higher definitely does. No guessing, no timestamps, no clocks — just two monotonically increasing integers.

Bitbucket's implementation

Now the solution writes itself. Two halves: capture the LSN on write, consult it on read.

On write — remember where the user landed

When a user commits a write, the change is applied on the master and a WAL entry is created. A Django middleware then asks Postgres for the LSN of that write and stores it, keyed by the user:

The store is just Redis holding { atlassian_user_id → last_write_lsn }. That's the entire write-side state: for each user, the LSN of their most recent write. (Newer writes only push the LSN up, since it's monotonic, so a simple overwrite is correct.)

On read — pick a replica that has caught up

At the start of every request, after authentication (so the user ID is known), the middleware decides which connection this request's reads will use:

Fetch the user's last-write LSN from Redis. (No entry? The user hasn't written recently — any replica is fine.)
Ask each replica for its current replay LSN.
Keep the replicas whose replay LSN ≥ the user's LSN — these provably contain the user's write.
Pick one of those caught-up replicas (at random, to spread load) as this request's read replica. The master connection is always available as the fallback.

The request then proceeds normally: its read queries go to the chosen caught-up replica, its writes go to the master. The user is guaranteed to see their own writes — without those reads ever touching the master.

The actual SQL

Two pieces of information drive the whole thing, and Postgres exposes both directly.

On the master — "what's the current LSN?" This is the position you store after a user's write:

-- Current WAL write position on the primary (PostgreSQL 10+)
SELECT pg_current_wal_lsn();
--      pg_current_wal_lsn
--      --------------------
--      0/16B3748        <- an LSN, stored per-user in Redis

On each replica — "how far have you replayed?" You don't want a raw integer subtraction on LSNs (they're a structured 64-bit value, not a plain counter), so Postgres ships a helper, pg_wal_lsn_diff(), to compute the byte distance between two LSNs safely:

-- Run on a replica: is it caught up to the user's write LSN?
-- Positive result => replica's replay position is at or beyond the user's LSN.
SELECT pg_wal_lsn_diff(
         pg_last_wal_replay_lsn(),   -- how far this replica has replayed
         :user_lsn                   -- the LSN we stored for this user
       ) >= 0 AS caught_up;
--      caught_up
--      -----------
--      t          <- safe to route this user's reads here

That boolean — caught_up — is the entire routing primitive. Run it against each replica, keep the ones that return true, pick one.

A naming footnote. If you read the original Atlassian post you'll see the older function names: pg_current_xlog_location(), pg_last_xlog_replay_location(), and pg_xlog_location_diff(). PostgreSQL 10 renamed everything from xlog → wal (to stop people confusing the write-ahead log with regular logs). Same functions, newer spelling — use the wal versions on any modern Postgres.

Sketching the middleware logic around those queries:

# Django middleware — runs at the start of each request, post-auth.
def choose_read_connection(user_id):
    user_lsn = redis.get(f"lsn:{user_id}")        # 1 Redis read
    if user_lsn is None:
        return random.choice(replicas)            # no recent write: any replica

    caught_up = []
    for r in replicas:                            # 1 cheap query per replica
        if r.query("SELECT pg_wal_lsn_diff("
                   "pg_last_wal_replay_lsn(), %s) >= 0", [user_lsn]):
            caught_up.append(r)

    return random.choice(caught_up) if caught_up else master  # fall back to master

How expensive is this, really?

It looks heavy — an extra Redis call and a query against every replica on each request. The surprise is how cheap it actually is.

Per request you pay: one Redis read (sub-millisecond) to fetch the user's LSN, plus one tiny query per replica to read its replay position. Those replica checks are trivial (a single function call, no table access) and can be fired in parallel. With a handful of replicas at ~1–2 ms each, Bitbucket measured the total added latency at roughly 10 ms per request — an overhead Atlassian found entirely tolerable.

And the payoff was large. By routing the vast majority of reads to caught-up replicas — with only the genuinely un-replicatable reads and all writes going to the master — Bitbucket cut requests to the primary by about 50%. The master got dramatically more headroom: more stable, more performant, and a candidate to scale down if they chose, all while the replicas finally earned their keep.

Why this is the elegant version. Plenty of read-your-writes schemes exist — sticky sessions, "read from master for N seconds after a write," version tokens passed to the client. Most either pin too much traffic to the master or need cooperation from the client. The LSN approach needs neither: it asks the database the one question that actually matters — "have you replayed past my write?" — and lets a cheap monotonic comparison make the call.

The lesson

Read-your-writes consistency isn't a property you turn on — it's a routing decision you make on every request, and the database already knows the answer.

The trick that makes it cheap is the WAL's log sequence number: a single monotonically increasing integer that lets you ask any replica "are you caught up to my last write?" and get a definitive yes or no. Store each user's write LSN in Redis, compare against each replica's replay LSN, route the reads to a replica that has caught up, and you get the guarantee that matters — you always see your writes — while keeping the firehose of reads off your master. Ten milliseconds of overhead, half the load gone. That's a trade most systems should be glad to make.