Consistency Models: Why Your Database Lies to You (And When That’s Fine)
How Amazon's haunted shopping cart bug taught us that 'eventually consistent' isn't broken—it's a feature
The Day Amazon’s Shopping Cart Became Haunted
In 2007, Amazon engineers discovered something disturbing: deleted items were coming back from the dead.
A customer would remove an item from their cart, continue shopping, then check out—only to find the “deleted” item had reappeared and been charged to their card. Support tickets flooded in. The bug wasn’t random—it happened during network partitions between Dynamo replicas.
Here’s what was happening: when a customer deleted an item, the write went to Replica A. Replica B, temporarily partitioned, never got the message. When the network healed, Replica B’s version—which still contained the item—merged with Replica A’s version. No clear “winner,” so the item resurrected.
Amazon’s response? They didn’t fix the bug. They redesigned the entire system to embrace it.
The result was Dynamo, the database that pioneered eventual consistency at scale, introduced vector clocks for conflict resolution, and changed how we think about distributed data.
This is where most engineers’ understanding stops: “CAP says you can’t have consistency and availability during partitions.” True. But that binary misses the entire spectrum of how consistent your system actually needs to be.
The Consistency Spectrum Nobody Explains
CAP doesn’t say your data is either “perfectly consistent” or “complete chaos.” Between those extremes lies a gradient as wide as the difference between a bank transfer and a Twitter like.
The full spectrum:
Strong ←────────────────────────────────────────────────→ Weak
│ │ │ │ │
Linearizable Sequential Causal Session Eventual
High latency Low latency
Expensive Cheap
Simple reasoning Complex bugs
Every distributed database picks a point on this line. Your job as an engineer: know which point, understand the trade-offs, and not promise guarantees your system doesn’t provide.
The Four Models That Matter
1. Linearizability (Strongest)
Definition: Every operation appears to execute atomically at some point between its start and completion. All clients see operations in the same real-time order.
Think of it as: A single-threaded system, no matter how distributed it actually is.
Code behavior:
// Client A writes
await db.write(’x’, 5); // completes at time T1
// Client B reads immediately after
const val = await db.read(’x’); // guaranteed to see 5
console.log(val); // always prints 5, never 0 or stale value
Real systems: Google Spanner (using TrueTime atomic clocks), etcd, Zookeeper
Cost: High latency. Every write needs global coordination. Cross-region writes can take 100ms+.
When to use: Financial transactions, inventory management, anything where “approximately right” means “catastrophically wrong.”
2. Causal Consistency (The Pragmatic Middle)
Definition: If operation A causally affects operation B (e.g., B reads A’s write), all nodes see them in that order. Independent operations can appear in any order.
Think of it as: Preserving the story’s plot, but letting unrelated subplots unfold in parallel.
Code behavior:
// Thread 1: Post a tweet
await db.write(’tweet:123’, ‘Hello world’);
await db.write(’tweet:123:likes’, 0); // causally dependent
// Thread 2: Read elsewhere
const tweet = await db.read(’tweet:123’); // may be stale
const likes = await db.read(’tweet:123:likes’);
// Guarantee: if likes exists, tweet must also be visible
// No guarantee: how fresh either value is
Real systems: Azure Cosmos DB (default), Cassandra with careful config, some Redis setups
Cost: Medium latency. Requires tracking causal dependencies (vector clocks, version vectors) but no global locks.
When to use: Social networks, collaborative apps, anywhere causality matters but exact timing doesn’t.
3. Sequential Consistency
Definition: Operations from each client execute in program order, but there’s no guarantee of real-time ordering across clients.
Think of it as: Everyone agrees on A happening before B for one client, but Client 1’s timeline and Client 2’s timeline might interleave differently.
Code behavior:
// Client A
await db.write(’x’, 1);
await db.write(’x’, 2);
// Client B always sees 1 then 2 (or just 2, never 2 then 1)
// But Client C might see:
// Client A: x=1, x=2
// Client D: something else from D, x=1, x=2
// Different global orders are allowed
Real systems: Some distributed SQL databases, older MongoDB replica sets
Cost: Medium-low latency. Cheaper than linearizability but still requires some coordination.
When to use: Analytics systems, read-heavy workloads where per-user consistency matters but cross-user doesn’t.
4. Eventual Consistency (Weakest, Fastest)
Definition: If no new updates occur, eventually all replicas converge to the same value. Before then: anything goes.
Think of it as: “Trust me, it’ll make sense... eventually.”
Code behavior:
// Write to DynamoDB
await dynamoDB.putItem({
TableName: ‘Users’,
Item: { userId: ‘123’, name: ‘Alice’ }
});
// Read immediately from different replica
const user = await dynamoDB.getItem({
TableName: ‘Users’,
Key: { userId: ‘123’ },
ConsistentRead: false // eventual consistency
});
console.log(user.Item.name);
// Might print: undefined (write hasn’t propagated)
// Might print: “Bob” (old value still cached)
// Eventually prints: “Alice”
Real systems: DynamoDB (default), Cassandra at consistency level ONE, most CDNs
Cost: Lowest latency. Writes return immediately, reads hit local cache.
When to use: View counts, likes, analytics dashboards, CDN content—anywhere staleness is annoying but not damaging.
The Hybrid Models You’ll Actually Use
Real systems rarely pick one model globally. Instead, they offer tunable consistency per operation.
DynamoDB: Eventual by Default, Strong on Demand
// Fast but potentially stale
const staleUser = await dynamoDB.getItem({
TableName: ‘Users’,
Key: { userId: ‘123’ },
ConsistentRead: false // default, eventual
});
// Slow but guaranteed fresh
const freshUser = await dynamoDB.getItem({
TableName: ‘Users’,
Key: { userId: ‘123’ },
ConsistentRead: true // forces read from primary
});
AWS documentation: Read Consistency Options
Cassandra: Choose Per Query
// Write to majority of replicas (CP-leaning)
await cassandra.execute(
‘INSERT INTO users (id, name) VALUES (?, ?)’,
[123, ‘Alice’],
{ consistency: cassandra.types.consistencies.quorum }
);
// Read from any replica (AP-leaning)
const result = await cassandra.execute(
‘SELECT name FROM users WHERE id = ?’,
[123],
{ consistency: cassandra.types.consistencies.one }
);
Pattern: Strong writes for critical data, eventual reads for performance.
Session Consistency: The Mobile App Sweet Spot
Definition: Within a single user session, you see your own writes and causally-related operations. Across sessions: no guarantees.
Why it matters: Users expect their own actions to be reflected immediately. They don’t care if other users see stale data for a few seconds.
// Mobile app example
async function updateProfile(userId, newBio) {
// Write with session token
await db.write(
`user:${userId}:bio`,
newBio,
{ sessionToken: user.session }
);
// Immediate read in same session - guaranteed to see new bio
const profile = await db.read(
`user:${userId}:bio`,
{ sessionToken: user.session }
);
return profile; // always returns newBio
}
// Different user reads same profile - might see old bio
// But that’s fine, they’ll see the update within seconds
Systems that provide this: Azure Cosmos DB (session consistency), MongoDB with read preference primary
The Decision Tree
Choose Linearizable when:
Money is involved (payments, account balances)
Inventory is limited (ticket sales, product stock)
Regulatory compliance requires audit trails
Pattern: Correctness > Speed, always
Choose Causal when:
Social interactions (posts, comments, reactions)
Collaborative editing (Google Docs, Figma)
Chat applications
Pattern: User expectations require logical ordering
Choose Session when:
User profiles and preferences
Shopping carts
Any single-user workflow
Pattern: “I see my changes immediately” matters, cross-user sync doesn’t
Choose Eventual when:
Analytics and metrics
Content recommendations
Search indices
View/like counts
Pattern: Speed matters, approximate is good enough
How Systems Fail (And Recover)
Twitter’s 2014 Timeline Ordering Bug
Problem: Tweets appeared out-of-order during replication lag
Root cause: Causal consistency violated—replies appeared before original tweets
Fix: Added explicit happens-before tracking for tweet threads
Reddit Vote Count Jumps
Problem: Upvote count changes dramatically on page refresh
Root cause: Eventual consistency—counting across replicas with stale reads
Fix: Added “updated X seconds ago” indicators, set user expectations
The Amazon Shopping Cart (2007)
Problem: Deleted items resurrected after partition
Root cause: Eventual consistency with no conflict resolution
Fix: Vector clocks + client-side merge logic in Dynamo
Key lesson: Weak consistency isn’t failure. Hiding it from users is.
Debugging Consistency Issues
Symptom: “My write disappeared”
Check if you’re reading from a different replica than you wrote to
Verify write acknowledgment (did it actually succeed?)
Look for partition healing—merges can “undo” writes
Symptom: “Data time-travels backward”
Reading from replicas with different lag
Check read-after-write guarantees in your client library
Consider session consistency or sticky routing
Symptom: “Conflicts I can’t explain”
Concurrent writes during partition
Check if system uses last-write-wins (LWW) or vector clocks
Look for application-level conflict resolution bugs
Practical Takeaways
1. Consistency is a spectrum, not a binary
Stop saying “my DB is consistent.” Specify: linearizable? causal? eventual?
2. Most apps need multiple consistency levels
Bank balance: strong. User bio: session. Feed ranking: eventual.
3. Latency and consistency are inversely related
Every 9 of consistency costs you a 0 in latency. Pick your battles.
4. Communicate your guarantees to users
“Updated 5 seconds ago” > silent staleness. Honesty scales.
5. Test with partitions, not just load
Use chaos engineering (Chaos Monkey, Gremlin) to simulate real failure modes.
Further Reading
This is Chapter 2 from my book on distributed systems fundamentals. Subscribe for weekly breakdowns of consensus, replication, and the other concepts that actually matter in production.
Quick context: If you’re jumping in here, this builds on CAP Theorem—the fundamental trade-off between consistency and availability during network partitions. Start with CAP first if you haven’t read it yet. This post assumes you understand why systems can’t have perfect consistency AND availability.
This is where most engineers’ understanding stops: “CAP says you can’t have consistency and availability during partitions.” True. But that binary misses the entire spectrum of **how consistent** your system actually needs to be.


