Redis Fallback Architecture: Designing for Failure
Why This Problem Matters
Redis is often treated as a "primary" data store in the minds of developers because it is so reliable. But when Redis fails—whether due to network partitions, memory limits, or a cloud provider outage—the consequences can be catastrophic.
If your application treats a cache miss as a fatal error, your availability drops to zero. If your fallback strategy indiscriminately hammers your primary database, you risk cascading failures that take down your entire platform. This article explores how to architect a robust fallback mechanism that treats Redis as an optimization, not a dependency.
Typical Redis Failure Scenarios
Before designing a solution, we must understand how Redis fails. It is rarely a clean "off" switch.
- Connection Timeouts: The server is up, but the network is congested. The client waits for
xseconds before giving up. - Latency Spikes: A slow command (like
KEYS *) blocks the single-threaded event loop, causing all other requests to time out. - Eviction Storms: Redis runs out of memory and aggressively evicts keys, causing a massive spike in cache misses.
- Hard Down: The instance crashes or the cloud provider has an outage.
Design Goals
Our fallback architecture must satisfy three specific goals:
- Fail Open: If Redis is down, the user should still get data (even if it's slightly slower).
- Protect the Database: We cannot simply redirect 100% of cache traffic to the DB, or it will melt.
- Self-Healing: When Redis comes back, the system should automatically recover without manual intervention.
Architecture Overview
The standard approach is a Circuit Breaker pattern layered over a Cache-Aside strategy.
A typical request flow during a Redis outage looks like this:
Request → API → Circuit Breaker (OPEN)
→ Skip Redis → Database
→ Return response
→ (Optional) Async cache warm-up when Redis recovers
- Application requests data.
- Circuit Breaker checks the health of the Redis connection.
- If Healthy: Attempt to read from Redis.
- If Unhealthy (Open State): Skip Redis entirely and go directly to the fallback (Database or Local Cache).
- Background: Periodically check if Redis is back online (Half-Open State).
Fallback Strategies
1. Graceful Degradation (The "Good Enough" Approach)
If the data isn't critical (e.g., recommendation lists, view counts), return a default value or empty list instead of querying the database. This preserves the core user experience (signing in, checkout) while sacrificing non-essential features.
2. The Circuit Breaker
We wrap all Redis calls in a circuit breaker. If timeouts exceed a threshold (e.g., 5 failures in 10 seconds), the breaker "trips."
- Closed: Traffic flows to Redis.
- Open: Traffic goes straight to DB/Fallback. Redis is given time to recover.
- Half-Open: Allow 1 test request through. If it succeeds, close the breaker.
3. Local In-Memory L1 Cache
For high-read, low-change data (like configuration flags), use a local in-memory cache (like lru-cache in Node or Guava in Java) with a short TTL (e.g., 30 seconds). This acts as a buffer if Redis goes down.
Implementation Example
Here is a simplified Python example using a Circuit Breaker pattern:
import time
from functools import wraps
class CircuitBreaker:
def __init__(self, failure_threshold=5, recovery_timeout=30):
self.failures = 0
self.threshold = failure_threshold
self.recovery_timeout = recovery_timeout
self.last_failure_time = 0
self.state = "CLOSED"
def call(self, func, *args, **kwargs):
if self.state == "OPEN":
if time.time() - self.last_failure_time > self.recovery_timeout:
self.state = "HALF_OPEN"
else:
return None # Fail fast / Fallback
try:
result = func(*args, **kwargs)
if self.state == "HALF_OPEN":
self.reset()
return result
except Exception as e:
self.failures += 1
self.last_failure_time = time.time()
if self.failures >= self.threshold:
self.state = "OPEN"
return None # Return None to trigger DB fallback
def get_user_data(user_id):
# Try Redis via Circuit Breaker
cache_data = breaker.call(redis_client.get, f"user:{user_id}")
if cache_data:
return cache_data
# Fallback to Database
print("Fetching from DB...")
db_data = db.query(f"SELECT * FROM users WHERE id = {user_id}")
# Asynchronous Write-Back (Optional)
# Only write back to Redis when the breaker is closed
# to avoid overwhelming a recovering cache
if breaker.state == "CLOSED":
redis_client.set(f"user:{user_id}", db_data)
return db_data