feat(redis): survive transient Redis outages with bounded reconnects#76
feat(redis): survive transient Redis outages with bounded reconnects#76ChiragAgg5k wants to merge 1 commit intomainfrom
Conversation
The broker's consume() loop previously rethrew any RedisException raised
during the blocking pop, crashing the worker on every transient network
blip. The connection layer also opened a brand-new socket on the first
call with no retry, so a single DNS or TCP hiccup during boot would take
the process down.
Connection layer (Redis, RedisCluster):
- getRedis() now retries up to 5 attempts with exponential backoff
(100ms base, 3s cap) and full jitter to avoid thundering herd on
recovery.
- close() is best-effort and swallows Throwable so a dead socket
doesn't mask the original error.
Broker (Redis):
- On RedisException during pop, drop the stale connection and retry
with exponential backoff (100ms base, 5s cap, full jitter). Worker
stays alive across outages instead of crash-looping under a
supervisor.
- Attempt counter resets on the first successful pop so each outage
starts from a fresh backoff.
Relies on the SWOOLE_HOOK_ALL hook flags set in Adapter/Swoole.php so
usleep yields cooperatively inside coroutines rather than blocking the
reactor.
Greptile SummaryThis PR hardens the Redis broker and connection adapters to survive transient Redis outages by adding exponential backoff with full jitter in
Confidence Score: 4/5Safe to merge for standalone Redis users; blocks for RedisCluster users until the exception type is widened. One P1 finding: the catch type in Broker/Redis.php doesn't cover \RedisClusterException, so the primary goal of this PR (surviving outages) is only partially achieved. The connection-layer changes are solid. src/Queue/Broker/Redis.php — the catch clause needs to be widened to include \RedisClusterException. Important Files Changed
Reviews (1): Last reviewed commit: "feat(redis): survive transient Redis out..." | Re-trigger Greptile |
| try { | ||
| $nextMessage = $this->connection->rightPopArray("{$queue->namespace}.queue.{$queue->name}", self::POP_TIMEOUT); | ||
| $reconnectAttempts = 0; | ||
| } catch (\RedisException $e) { |
There was a problem hiding this comment.
RedisCluster outages not caught
The broker catches \RedisException, but when the connection layer is Connection/RedisCluster, phpredis throws \RedisClusterException (a sibling of \RedisException under RuntimeException, not a subclass). Any cluster outage during brPop will propagate uncaught and crash the worker just as before this PR — the new reconnect logic silently doesn't apply to the cluster path.
To cover both adapters, widen the catch:
| } catch (\RedisException $e) { | |
| } catch (\RedisException|\RedisClusterException $e) { |
| try { | ||
| $redis->connect($this->host, $this->port, $connectTimeout); | ||
|
|
||
| if ($this->readTimeout >= 0) { | ||
| $redis->setOption(\Redis::OPT_READ_TIMEOUT, $this->readTimeout); | ||
| } | ||
|
|
||
| $this->redis = $redis; | ||
| return $this->redis; |
There was a problem hiding this comment.
Leaked socket if
setOption() throws
$redis has an active TCP connection after connect() succeeds. If setOption() throws a \RedisException, the exception is caught by the retry loop, a backoff sleep is applied, and a brand-new $redis is created — but the successfully-connected socket from the previous attempt is never closed. Each failed setOption() attempt leaks a file descriptor. While setOption rarely throws, wrapping the entire attempt in a try/finally to call $redis->close() on exception would be the safe pattern.
| $shift = \min(self::RECONNECT_BACKOFF_CAP_SHIFT, $reconnectAttempts - 1); | ||
| $backoffMs = \min( | ||
| self::RECONNECT_MAX_BACKOFF_MS, | ||
| self::RECONNECT_BASE_BACKOFF_MS * (2 ** $shift), | ||
| ); |
There was a problem hiding this comment.
RECONNECT_BACKOFF_CAP_SHIFT doesn't prevent integer overflow
The constant is described as guarding against integer overflow, but RECONNECT_MAX_BACKOFF_MS (5 000 ms) is always the binding cap — 2 ** 10 × 100 = 102 400 ms never takes effect in practice because min(5000, 102400) = 5000. The constant is harmless but the overflow-prevention justification is misleading; consider updating the comment to clarify it simply limits the exponent for readability.
Summary
Harden the Redis broker and connection adapters so workers survive transient Redis outages (DNS flaps, failover, restarts, brief network partitions) instead of crash-looping under a supervisor.
Connection/Redis.php,Connection/RedisCluster.php): lazygetRedis()now retries up to 5 attempts with exponential backoff + full jitter (100 ms base, 3 s cap) before throwing.close()is best-effort — swallowsThrowableso a dead socket doesn't mask the original error.Broker/Redis.php):consume()catchesRedisExceptionraised by the blocking pop, drops the stale connection, applies exponential backoff with full jitter (100 ms base, 5 s cap), and continues. The attempt counter resets on the first successful pop so each outage starts from a fresh backoff curve.Motivation
Before this change, a single
RedisExceptionduringbrPopwould bubble out ofconsume()and kill the worker process. Any transient Redis issue — failover, restart, brief network partition — caused the whole fleet to crash simultaneously and rely on the process supervisor to restart them, which re-opens every connection at the same instant and creates a thundering herd on the recovering Redis.Similarly,
getRedis()opened a single socket with no retry, so a one-off DNS or TCP hiccup during boot would surface as an unrecoverable failure to the caller.What changed
src/Queue/Connection/Redis.phpCONNECT_MAX_ATTEMPTS(5),CONNECT_BASE_BACKOFF_MS(100),CONNECT_MAX_BACKOFF_MS(3 000) constants.getRedis()wrapsnew \Redis()+connect()+setOption()in a retry loop. On failure it re-throws the original\RedisExceptionfrom the final attempt (no null-throw; PHPStan-clean).close()wraps$this->redis?->close()intry/finallyand swallows\Throwable. The socket may already be dead at this point.src/Queue/Connection/RedisCluster.phpclose()hardening, same loop aroundnew \RedisCluster(...), catching\RedisClusterException.src/Queue/Broker/Redis.phpRECONNECT_BASE_BACKOFF_MS(100),RECONNECT_MAX_BACKOFF_MS(5 000),RECONNECT_BACKOFF_CAP_SHIFT(10) constants.consume()catches\RedisExceptionfrom the blocking pop. If the broker was closed, it exits cleanly. Otherwise it drops the stale connection via$this->connection->close(), sleeps formt_rand(0, backoffMs)(full jitter), and continues the loop.MAX_BACKOFF_MScaps the wall-clock sleep independently.Swoole considerations
The
usleep()calls cooperate with the Swoole reactor becausesrc/Queue/Adapter/Swoole.php:37setsSWOOLE_HOOK_ALL, which hooksusleeptoCoroutine::sleep. If that flag is ever narrowed, these sleeps will block the reactor — worth a note for future maintenance.Design notes
consume()— a worker should stay alive across arbitrarily long outages. Operators rely onclosed=truefromclose()to end the loop.\RedisExceptionin the broker, which is an abstraction leak (broker depends on phpredis exception types) but consistent with the file's name and existing coupling. Translating to a neutralConnectionExceptionis out of scope here.Test plan
CONNECT_MAX_ATTEMPTStimes, then throws the final\RedisException(verify no null throw).[0, 100]ms, capped sleeps never exceedMAX_BACKOFF_MS.docker compose stop redis) while a worker is idle inbrPop; verify the worker logs no crash, recovers ondocker compose start redis, and resumes consuming.vendor/bin/phpstan analyse --memory-limit=1Greports zero errors.Out of scope / follow-ups
Connection\Redisconstructor accepts$user/$passwordbutgetRedis()never callsauth(). Pre-existing bug; worth a separate PR.utopia-php/telemetry.ConnectionExceptionso the broker stops depending on\RedisExceptiondirectly.CONNECT_MAX_ATTEMPTSetc. to constructor parameters if operators want to tune them per-deployment.