Skip to content

feat(redis): survive transient Redis outages with bounded reconnects#76

Open
ChiragAgg5k wants to merge 1 commit intomainfrom
feat/redis-resilience-retries
Open

feat(redis): survive transient Redis outages with bounded reconnects#76
ChiragAgg5k wants to merge 1 commit intomainfrom
feat/redis-resilience-retries

Conversation

@ChiragAgg5k
Copy link
Copy Markdown
Member

Summary

Harden the Redis broker and connection adapters so workers survive transient Redis outages (DNS flaps, failover, restarts, brief network partitions) instead of crash-looping under a supervisor.

  • Connection layer (Connection/Redis.php, Connection/RedisCluster.php): lazy getRedis() now retries up to 5 attempts with exponential backoff + full jitter (100 ms base, 3 s cap) before throwing. close() is best-effort — swallows Throwable so a dead socket doesn't mask the original error.
  • Broker (Broker/Redis.php): consume() catches RedisException raised by the blocking pop, drops the stale connection, applies exponential backoff with full jitter (100 ms base, 5 s cap), and continues. The attempt counter resets on the first successful pop so each outage starts from a fresh backoff curve.

Motivation

Before this change, a single RedisException during brPop would bubble out of consume() and kill the worker process. Any transient Redis issue — failover, restart, brief network partition — caused the whole fleet to crash simultaneously and rely on the process supervisor to restart them, which re-opens every connection at the same instant and creates a thundering herd on the recovering Redis.

Similarly, getRedis() opened a single socket with no retry, so a one-off DNS or TCP hiccup during boot would surface as an unrecoverable failure to the caller.

What changed

src/Queue/Connection/Redis.php

  • Added CONNECT_MAX_ATTEMPTS (5), CONNECT_BASE_BACKOFF_MS (100), CONNECT_MAX_BACKOFF_MS (3 000) constants.
  • getRedis() wraps new \Redis() + connect() + setOption() in a retry loop. On failure it re-throws the original \RedisException from the final attempt (no null-throw; PHPStan-clean).
  • close() wraps $this->redis?->close() in try/finally and swallows \Throwable. The socket may already be dead at this point.

src/Queue/Connection/RedisCluster.php

  • Same retry constants, same close() hardening, same loop around new \RedisCluster(...), catching \RedisClusterException.

src/Queue/Broker/Redis.php

  • Added RECONNECT_BASE_BACKOFF_MS (100), RECONNECT_MAX_BACKOFF_MS (5 000), RECONNECT_BACKOFF_CAP_SHIFT (10) constants.
  • consume() catches \RedisException from the blocking pop. If the broker was closed, it exits cleanly. Otherwise it drops the stale connection via $this->connection->close(), sleeps for mt_rand(0, backoffMs) (full jitter), and continues the loop.
  • Shift cap prevents integer overflow on very long-running processes; MAX_BACKOFF_MS caps the wall-clock sleep independently.
  • Attempt counter resets to 0 after any successful pop.

Swoole considerations

The usleep() calls cooperate with the Swoole reactor because src/Queue/Adapter/Swoole.php:37 sets SWOOLE_HOOK_ALL, which hooks usleep to Coroutine::sleep. If that flag is ever narrowed, these sleeps will block the reactor — worth a note for future maintenance.

Design notes

  • Full jitter is used (rather than equal or decorrelated jitter) because the realistic failure mode is all workers losing the connection simultaneously; full jitter produces the flattest arrival distribution on recovery.
  • Broker retries unbounded. There is no max-attempt ceiling in consume() — a worker should stay alive across arbitrarily long outages. Operators rely on closed=true from close() to end the loop.
  • Caught type is \RedisException in the broker, which is an abstraction leak (broker depends on phpredis exception types) but consistent with the file's name and existing coupling. Translating to a neutral ConnectionException is out of scope here.

Test plan

  • Unit: connection retries exactly CONNECT_MAX_ATTEMPTS times, then throws the final \RedisException (verify no null throw).
  • Unit: backoff bounds — first attempt sleep is in [0, 100]ms, capped sleeps never exceed MAX_BACKOFF_MS.
  • Integration: kill Redis (docker compose stop redis) while a worker is idle in brPop; verify the worker logs no crash, recovers on docker compose start redis, and resumes consuming.
  • Integration: kill Redis mid-message handling; verify the in-flight message completes or is moved to the failed list without losing the worker.
  • Integration: confirm with multiple workers that reconnect times are spread out (full-jitter sanity check), not all at t+0.
  • Static: vendor/bin/phpstan analyse --memory-limit=1G reports zero errors.

Out of scope / follow-ups

  • Connection\Redis constructor accepts $user/$password but getRedis() never calls auth(). Pre-existing bug; worth a separate PR.
  • Telemetry hook or log on reconnect — operators currently get no signal when a worker enters the retry loop. Candidate for a follow-up using utopia-php/telemetry.
  • Translate phpredis exceptions into a driver-neutral ConnectionException so the broker stops depending on \RedisException directly.
  • Promote CONNECT_MAX_ATTEMPTS etc. to constructor parameters if operators want to tune them per-deployment.

The broker's consume() loop previously rethrew any RedisException raised
during the blocking pop, crashing the worker on every transient network
blip. The connection layer also opened a brand-new socket on the first
call with no retry, so a single DNS or TCP hiccup during boot would take
the process down.

Connection layer (Redis, RedisCluster):
  - getRedis() now retries up to 5 attempts with exponential backoff
    (100ms base, 3s cap) and full jitter to avoid thundering herd on
    recovery.
  - close() is best-effort and swallows Throwable so a dead socket
    doesn't mask the original error.

Broker (Redis):
  - On RedisException during pop, drop the stale connection and retry
    with exponential backoff (100ms base, 5s cap, full jitter). Worker
    stays alive across outages instead of crash-looping under a
    supervisor.
  - Attempt counter resets on the first successful pop so each outage
    starts from a fresh backoff.

Relies on the SWOOLE_HOOK_ALL hook flags set in Adapter/Swoole.php so
usleep yields cooperatively inside coroutines rather than blocking the
reactor.
@greptile-apps
Copy link
Copy Markdown

greptile-apps Bot commented Apr 22, 2026

Greptile Summary

This PR hardens the Redis broker and connection adapters to survive transient Redis outages by adding exponential backoff with full jitter in getRedis() (connection layer) and consume() (broker layer), and making close() best-effort. The connection-layer changes are clean; however, the broker's reconnect logic catches only \\RedisException, leaving RedisCluster users unprotected since phpredis throws the sibling type \\RedisClusterException from cluster operations.

  • P1: Broker/Redis.php catch (\\RedisException) misses \\RedisClusterException — workers backed by Connection/RedisCluster will still crash on cluster outages despite this PR.

Confidence Score: 4/5

Safe to merge for standalone Redis users; blocks for RedisCluster users until the exception type is widened.

One P1 finding: the catch type in Broker/Redis.php doesn't cover \RedisClusterException, so the primary goal of this PR (surviving outages) is only partially achieved. The connection-layer changes are solid.

src/Queue/Broker/Redis.php — the catch clause needs to be widened to include \RedisClusterException.

Important Files Changed

Filename Overview
src/Queue/Broker/Redis.php Adds reconnect loop with exponential backoff in consume(); the catch is \RedisException only, so RedisCluster outages (which throw \RedisClusterException) remain unhandled and will still crash the worker.
src/Queue/Connection/Redis.php Retry loop in getRedis() and best-effort close() are correct; minor socket-leak risk if setOption() throws after a successful connect().
src/Queue/Connection/RedisCluster.php Retry logic mirrors Connection/Redis.php and correctly catches \RedisClusterException; best-effort close() hardening is consistent.

Reviews (1): Last reviewed commit: "feat(redis): survive transient Redis out..." | Re-trigger Greptile

try {
$nextMessage = $this->connection->rightPopArray("{$queue->namespace}.queue.{$queue->name}", self::POP_TIMEOUT);
$reconnectAttempts = 0;
} catch (\RedisException $e) {
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 RedisCluster outages not caught

The broker catches \RedisException, but when the connection layer is Connection/RedisCluster, phpredis throws \RedisClusterException (a sibling of \RedisException under RuntimeException, not a subclass). Any cluster outage during brPop will propagate uncaught and crash the worker just as before this PR — the new reconnect logic silently doesn't apply to the cluster path.

To cover both adapters, widen the catch:

Suggested change
} catch (\RedisException $e) {
} catch (\RedisException|\RedisClusterException $e) {

Comment on lines +205 to +213
try {
$redis->connect($this->host, $this->port, $connectTimeout);

if ($this->readTimeout >= 0) {
$redis->setOption(\Redis::OPT_READ_TIMEOUT, $this->readTimeout);
}

$this->redis = $redis;
return $this->redis;
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Leaked socket if setOption() throws

$redis has an active TCP connection after connect() succeeds. If setOption() throws a \RedisException, the exception is caught by the retry loop, a backoff sleep is applied, and a brand-new $redis is created — but the successfully-connected socket from the previous attempt is never closed. Each failed setOption() attempt leaks a file descriptor. While setOption rarely throws, wrapping the entire attempt in a try/finally to call $redis->close() on exception would be the safe pattern.

Comment on lines +46 to +50
$shift = \min(self::RECONNECT_BACKOFF_CAP_SHIFT, $reconnectAttempts - 1);
$backoffMs = \min(
self::RECONNECT_MAX_BACKOFF_MS,
self::RECONNECT_BASE_BACKOFF_MS * (2 ** $shift),
);
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 RECONNECT_BACKOFF_CAP_SHIFT doesn't prevent integer overflow

The constant is described as guarding against integer overflow, but RECONNECT_MAX_BACKOFF_MS (5 000 ms) is always the binding cap — 2 ** 10 × 100 = 102 400 ms never takes effect in practice because min(5000, 102400) = 5000. The constant is harmless but the overflow-prevention justification is misleading; consider updating the comment to clarify it simply limits the exponent for readability.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant