Designing a Rate Limiter: Trade-offs, Failure Modes, and Production Reality

Why Rate Limiting Matters

Rate limiting is the defensive line of any distributed system. Without it, a single buggy script or a malicious DDoS attack can starve legitimate users of resources, degrade database performance, and drive up cloud costs[cite: 27, 28].

Beyond protection, rate limiting is a business requirement. It enforces tiered pricing models (e.g., "Basic users get 100 req/hour, Pro gets 1000") and manages quotas across microservices to prevent cascading failures[cite: 29].

Common Rate Limiting Strategies

There isn't one "correct" algorithm. Each comes with distinct behaviors.

1. Fixed Window The timeline is divided into fixed intervals (e.g., 1 minute). A counter increments for each request[cite: 31].

Pros: Simple, memory efficient.
Cons: Vulnerable to spikes at window boundaries (e.g., 100 requests at 10:00:59 and 100 at 10:01:01 allows double the rate)[cite: 32].

2. Sliding Window Log Tracks the timestamp of every request. When a new request comes in, we remove timestamps older than the window[cite: 33, 34].

Pros: Highly accurate.
Cons: Expensive. Storing timestamps for millions of requests consumes massive memory[cite: 35].

3. Token Bucket A bucket is filled with tokens at a constant rate. Each request consumes a token. If empty, the request is dropped[cite: 36, 37].

Pros: Allows for "bursts" of traffic while maintaining a steady average.
Cons: Slightly more complex to implement correctly in distributed environments[cite: 38].

4. Leaky Bucket Requests enter a queue (bucket) and are processed at a constant rate. If the queue is full, new requests are discarded[cite: 39, 40].

Pros: Smoothes out bursty traffic into a stable outflow.
Cons: Can introduce latency if the queue is long[cite: 41].

Choosing the Right Strategy (Trade-offs)

|     Metric    | Token Bucket | Leaky Bucket |  Fixed Window  | Sliding Window Log |
| :---          | :---         | :---         |  :---          | :---               |
| Burst Support |      Yes     |      No      | Yes(Accidental)|        No        |
| Accuracy      |   ⭐⭐⭐    |   ⭐⭐⭐   |      ⭐       |    ⭐⭐⭐⭐⭐    |
| Memory Cost   |   🟢 Low     |   🟢 Low    |     🟢 Low    |      🔴 High       |
| Complexity    |   🟡 Medium  |  🟡 Medium  |    🟢 Low     |     🟡 Medium      |

For most general-purpose APIs, Token Bucket is the industry standard because it balances fairness with the reality that traffic is naturally bursty[cite: 54].

Single-Instance Rate Limiting

On a single server, rate limiting is trivial. You can use an in-memory hash map or a Guava RateLimiter in Java[cite: 55, 56]. Since all state is local, there is no network latency and consistency is guaranteed[cite: 57].

However, modern systems are rarely single-instance. As soon as you add a load balancer and a second server, local rate limiting fails because Server A doesn't know about the traffic on Server B[cite: 58].

Distributed Rate Limiting

In a distributed system, we need a shared "source of truth" for our counters. Redis is the standard choice here due to its speed and support for atomic operations[cite: 58, 59]. The Naive Approach (Race Conditions) Read the counter -> Increment it -> Write it back. This fails under concurrency. Two requests reading "9" simultaneously will both write "10", effectively allowing one extra request[cite: 60, 61]. The Solution: Lua Scripts Redis allows us to run Lua scripts atomically. This ensures that the "check and increment" logic happens as a single, indivisible step[cite: 62, 63].

-- Simple Fixed Window Lua Script
local key = KEYS[1]
local limit = tonumber(ARGV[1])
local current = tonumber(redis.call('get', key) or "0")

if current + 1 > limit then
  return 0 -- Rejected
else
  redis.call("INCRBY", key, 1)
  redis.call("EXPIRE", key, 60) -- Set window ttl
  return 1 -- Allowed
end

Failure Modes Even with Redis, things go wrong.

Redis Down: If the rate limiter fails, do you block all traffic (fail closed) or let everything through (fail open)? In most cases, fail open is safer for business continuity, though risky for backend load.
Clock Skew: If you use Sliding Windows based on server timestamps, clock drift between machines can cause inconsistencies.
Hot Keys: A global rate limit (e.g., "10,000 req/s for the whole app") effectively turns one Redis key into a hotspot, potentially sharding it onto a single node and bottlenecking the cluster.
Network Latency: Every API call now pays the penalty of a round-trip to Redis.

Graceful Degradation Strategies When the rate limiter is under stress:

Local Caching: Store a short-lived count locally. Sync with Redis asynchronously (batching increments). This sacrifices strict consistency for performance.
Hierarchical Limits: Enforce a strict local limit (e.g., 100 req/s per pod) as a fallback if the distributed Redis check times out.

What I Would Do in Production If I were building this today:

Use a Sidecar: Deploy the rate limiter as a sidecar (like Envoy) rather than embedding logic in the application code. This decouples infrastructure from business logic.
Hybrid Approach: Use Redis for the authoritative count but cache the "REJECT" state locally for a few seconds to save Redis round-trips during a DDoS attack.
Metrics First: You can't limit what you don't measure. I would instrument specifically for "Limit Reached" events to tune the thresholds.

Closing Thoughts