Analysis

Cache-line contention makes RwLock five times slower than Mutex

Redstone's author found that on commercial hardware an RwLock made a read-heavy Tensor Cache about 5× slower than a Mutex due to cache-line ping-pong and atomic contention.

Sam Ortega•2/28/2026•3 min read

Published 10:25 AM

Listen to this article•0:00 min

Share this article:

Cache-line contention makes RwLock five times slower than Mutex — Source: external-preview.redd.it

Read Locks Are Not Your Friends." That is the blunt finding from the Eventual-consistency Vercel post about Redstone, a high-performance Tensor Cache in Rust: "On commercial hardware, RwLock was ~5× slower than Mutex for a read-heavy cache workload due to atomic contention and cache-line ping-pong." The author reports that replacing a Mutex with an RwLock actually throttled throughput where reads were tiny and frequent.

The workload is specific: a read-heavy cache that does a HashMap lookup and an Arc::clone inside a lock, just a few nanoseconds of real work per critical section. The author wrote, "I thought a read lock would probably mitigate this (boy, was I wrong :( ). I expected the throughput to go through the roof since multiple threads could finally read simultaneously, the competition was not even close, write locks outperformed read locks by around 5X." The post reproduces the accessors used for get_with_read and get_with_write in a block of Rust code showing inner.read() and inner.write() calls.

The post names the mechanism: "The culprit is a phenomenon known as Cache Line Ping-Pong." Low-level coherence traffic under MESI makes the RwLock metadata a hot cache line that bounces between cores; the post warns that "If you see a lot of time spent in `atomic_add`, you have cache contention." Rust-dd style explanations in the discussion pool add that RwLock must track reader counts and orchestrate wakeups while Mutex can be a cheaper atomic plus occasional slow path, so the bookkeeping itself can become the bottleneck on multicore commercial CPUs.

The Vercel author gives explicit remediation: "Beware of Short Critical Sections: If your work inside a lock takes only a few nanoseconds (like a Hashmap lookup), the overhead of an `RwLock` will almost always outweigh the benefits of concurrency." "Profile the Hardware: Use tools like `perf` or `cargo-flamegraph`. If you see a lot of time spent in `atomic_add`, you have cache contention." "Consider Sharding: Consider splitting your one giant cache into multiple buckets so that each holds a specific set of keys and has its own locking mechanisms, this will reduce the number of threads contending for the lock, it may also increase throughput since the number of truly parallel operations increases."

Community measurements are mixed and workload dependent. A Reddit poster reports "Apparently, `Arc<AtomicBool>` is ~67% faster in this case (average)." Another Reddit fragment reports "Arc<AtomicBool> is ~23% faster than `Arc<Mutex<bool>>` (average)" while a different Reddit poster claims "Even though `RwLock<T>` is slightly slower for isolated `write()` locks, it's ~63% faster in a whole loop cycle and has more stable timings (average)." A forum user measured lock costs directly: "Each locking of the `<Arc<Mutex<bool>>` takes ~201 ns while the rest of the loop takes (with a single `<Arc<Mutex<CPU>>` lock) ~418 ns. If I remove the CPU lock, it drops to ~90 ns." Users on the rust-lang channels reinforce the hardware story: "It takes as much effort to take `RwLock<T>` for `read` as with normal `Mutex` but since readers don't block each other it may be good thing to do if you do a lot of work under lock. But acqusition of lock is still sequential." and "Yes, because your program creates insane contention for that poor single cache line with `RwLock` everything else doesn't matter much."

Practical routes forward repeat the post and related advice: "Short, `await`-free critical sections: prefer `parking_lot::Mutex` (often faster) or even better, avoid shared mutation altogether." Don’t forget "Don’t forget pragmatic tools like `ArcSwap` for low-overhead snapshotting of complex shared data." For async sections, use tokio::Mutex or refactor to avoid awaiting inside locks. The clear, concrete conclusion is this: when critical sections are only a few nanoseconds and a single hot cache line receives all the traffic, RwLock can be dramatically slower than Mutex on commercial hardware; profile with perf or cargo-flamegraph, look for atomic_add hotspots, then shard, switch primitives to atomics or parking_lot, or redesign to eliminate the single hot lock.

Know something we missed? Have a correction or additional information?

Submit a Tip