feat(redis): survive transient Redis outages with bounded reconnects by ChiragAgg5k · Pull Request #76 · utopia-php/queue

ChiragAgg5k · 2026-04-22T14:20:23Z

Summary

Harden the Redis broker and connection adapters so workers survive transient Redis outages (DNS flaps, failover, restarts, brief network partitions) instead of crash-looping under a supervisor.

Connection layer (Connection/Redis.php, Connection/RedisCluster.php): lazy getRedis() now retries up to 5 attempts with exponential backoff + full jitter (100 ms base, 3 s cap) before throwing. close() is best-effort — swallows Throwable so a dead socket doesn't mask the original error.
Broker (Broker/Redis.php): consume() catches RedisException raised by the blocking pop, drops the stale connection, applies exponential backoff with full jitter (100 ms base, 5 s cap), and continues. The attempt counter resets on the first successful pop so each outage starts from a fresh backoff curve.

Motivation

Before this change, a single RedisException during brPop would bubble out of consume() and kill the worker process. Any transient Redis issue — failover, restart, brief network partition — caused the whole fleet to crash simultaneously and rely on the process supervisor to restart them, which re-opens every connection at the same instant and creates a thundering herd on the recovering Redis.

Similarly, getRedis() opened a single socket with no retry, so a one-off DNS or TCP hiccup during boot would surface as an unrecoverable failure to the caller.

What changed

`src/Queue/Connection/Redis.php`

Added CONNECT_MAX_ATTEMPTS (5), CONNECT_BASE_BACKOFF_MS (100), CONNECT_MAX_BACKOFF_MS (3 000) constants.
getRedis() wraps new \Redis() + connect() + setOption() in a retry loop. On failure it re-throws the original \RedisException from the final attempt (no null-throw; PHPStan-clean).
close() wraps $this->redis?->close() in try/finally and swallows \Throwable. The socket may already be dead at this point.

`src/Queue/Connection/RedisCluster.php`

Same retry constants, same close() hardening, same loop around new \RedisCluster(...), catching \RedisClusterException.

`src/Queue/Broker/Redis.php`

Added RECONNECT_BASE_BACKOFF_MS (100), RECONNECT_MAX_BACKOFF_MS (5 000), RECONNECT_BACKOFF_CAP_SHIFT (10) constants.
consume() catches \RedisException from the blocking pop. If the broker was closed, it exits cleanly. Otherwise it drops the stale connection via $this->connection->close(), sleeps for mt_rand(0, backoffMs) (full jitter), and continues the loop.
Shift cap prevents integer overflow on very long-running processes; MAX_BACKOFF_MS caps the wall-clock sleep independently.
Attempt counter resets to 0 after any successful pop.

Swoole considerations

The usleep() calls cooperate with the Swoole reactor because src/Queue/Adapter/Swoole.php:37 sets SWOOLE_HOOK_ALL, which hooks usleep to Coroutine::sleep. If that flag is ever narrowed, these sleeps will block the reactor — worth a note for future maintenance.

Design notes

Full jitter is used (rather than equal or decorrelated jitter) because the realistic failure mode is all workers losing the connection simultaneously; full jitter produces the flattest arrival distribution on recovery.
Broker retries unbounded. There is no max-attempt ceiling in consume() — a worker should stay alive across arbitrarily long outages. Operators rely on closed=true from close() to end the loop.
Caught type is \RedisException in the broker, which is an abstraction leak (broker depends on phpredis exception types) but consistent with the file's name and existing coupling. Translating to a neutral ConnectionException is out of scope here.

Test plan

Unit: connection retries exactly CONNECT_MAX_ATTEMPTS times, then throws the final \RedisException (verify no null throw).
Unit: backoff bounds — first attempt sleep is in [0, 100]ms, capped sleeps never exceed MAX_BACKOFF_MS.
Integration: kill Redis (docker compose stop redis) while a worker is idle in brPop; verify the worker logs no crash, recovers on docker compose start redis, and resumes consuming.
Integration: kill Redis mid-message handling; verify the in-flight message completes or is moved to the failed list without losing the worker.
Integration: confirm with multiple workers that reconnect times are spread out (full-jitter sanity check), not all at t+0.
Static: vendor/bin/phpstan analyse --memory-limit=1G reports zero errors.

Out of scope / follow-ups

Connection\Redis constructor accepts $user/$password but getRedis() never calls auth(). Pre-existing bug; worth a separate PR.
Telemetry hook or log on reconnect — operators currently get no signal when a worker enters the retry loop. Candidate for a follow-up using utopia-php/telemetry.
Translate phpredis exceptions into a driver-neutral ConnectionException so the broker stops depending on \RedisException directly.
Promote CONNECT_MAX_ATTEMPTS etc. to constructor parameters if operators want to tune them per-deployment.

The broker's consume() loop previously rethrew any RedisException raised during the blocking pop, crashing the worker on every transient network blip. The connection layer also opened a brand-new socket on the first call with no retry, so a single DNS or TCP hiccup during boot would take the process down. Connection layer (Redis, RedisCluster): - getRedis() now retries up to 5 attempts with exponential backoff (100ms base, 3s cap) and full jitter to avoid thundering herd on recovery. - close() is best-effort and swallows Throwable so a dead socket doesn't mask the original error. Broker (Redis): - On RedisException during pop, drop the stale connection and retry with exponential backoff (100ms base, 5s cap, full jitter). Worker stays alive across outages instead of crash-looping under a supervisor. - Attempt counter resets on the first successful pop so each outage starts from a fresh backoff. Relies on the SWOOLE_HOOK_ALL hook flags set in Adapter/Swoole.php so usleep yields cooperatively inside coroutines rather than blocking the reactor.

greptile-apps · 2026-04-22T14:22:58Z

Greptile Summary

This PR hardens the Redis broker and connection adapters to survive transient Redis outages by adding exponential backoff with full jitter in getRedis() (connection layer) and consume() (broker layer), and making close() best-effort. The connection-layer changes are clean; however, the broker's reconnect logic catches only \\RedisException, leaving RedisCluster users unprotected since phpredis throws the sibling type \\RedisClusterException from cluster operations.

P1: Broker/Redis.php catch (\\RedisException) misses \\RedisClusterException — workers backed by Connection/RedisCluster will still crash on cluster outages despite this PR.

Confidence Score: 4/5

Safe to merge for standalone Redis users; blocks for RedisCluster users until the exception type is widened.

One P1 finding: the catch type in Broker/Redis.php doesn't cover \RedisClusterException, so the primary goal of this PR (surviving outages) is only partially achieved. The connection-layer changes are solid.

src/Queue/Broker/Redis.php — the catch clause needs to be widened to include \RedisClusterException.

Important Files Changed

Filename	Overview
src/Queue/Broker/Redis.php	Adds reconnect loop with exponential backoff in consume(); the catch is `\RedisException` only, so `RedisCluster` outages (which throw `\RedisClusterException`) remain unhandled and will still crash the worker.
src/Queue/Connection/Redis.php	Retry loop in `getRedis()` and best-effort `close()` are correct; minor socket-leak risk if `setOption()` throws after a successful `connect()`.
src/Queue/Connection/RedisCluster.php	Retry logic mirrors `Connection/Redis.php` and correctly catches `\RedisClusterException`; best-effort `close()` hardening is consistent.

_{Reviews (1): Last reviewed commit: "feat(redis): survive transient Redis out..." | Re-trigger Greptile}

greptile-apps · 2026-04-22T14:23:07Z

            try {
                $nextMessage = $this->connection->rightPopArray("{$queue->namespace}.queue.{$queue->name}", self::POP_TIMEOUT);
+                $reconnectAttempts = 0;
            } catch (\RedisException $e) {


RedisCluster outages not caught

The broker catches \RedisException, but when the connection layer is Connection/RedisCluster, phpredis throws \RedisClusterException (a sibling of \RedisException under RuntimeException, not a subclass). Any cluster outage during brPop will propagate uncaught and crash the worker just as before this PR — the new reconnect logic silently doesn't apply to the cluster path.

To cover both adapters, widen the catch:

Suggested change

} catch (\RedisException $e) {

} catch (\RedisException|\RedisClusterException $e) {

greptile-apps · 2026-04-22T14:23:10Z

+            try {
+                $redis->connect($this->host, $this->port, $connectTimeout);
+
+                if ($this->readTimeout >= 0) {
+                    $redis->setOption(\Redis::OPT_READ_TIMEOUT, $this->readTimeout);
+                }
+
+                $this->redis = $redis;
+                return $this->redis;


Leaked socket if setOption() throws

$redis has an active TCP connection after connect() succeeds. If setOption() throws a \RedisException, the exception is caught by the retry loop, a backoff sleep is applied, and a brand-new $redis is created — but the successfully-connected socket from the previous attempt is never closed. Each failed setOption() attempt leaks a file descriptor. While setOption rarely throws, wrapping the entire attempt in a try/finally to call $redis->close() on exception would be the safe pattern.

greptile-apps · 2026-04-22T14:23:13Z

+                $shift = \min(self::RECONNECT_BACKOFF_CAP_SHIFT, $reconnectAttempts - 1);
+                $backoffMs = \min(
+                    self::RECONNECT_MAX_BACKOFF_MS,
+                    self::RECONNECT_BASE_BACKOFF_MS * (2 ** $shift),
+                );


RECONNECT_BACKOFF_CAP_SHIFT doesn't prevent integer overflow

The constant is described as guarding against integer overflow, but RECONNECT_MAX_BACKOFF_MS (5 000 ms) is always the binding cap — 2 ** 10 × 100 = 102 400 ms never takes effect in practice because min(5000, 102400) = 5000. The constant is harmless but the overflow-prevention justification is misleading; consider updating the comment to clarify it simply limits the exponent for readability.

greptile-apps Bot reviewed Apr 22, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(redis): survive transient Redis outages with bounded reconnects#76

feat(redis): survive transient Redis outages with bounded reconnects#76
ChiragAgg5k wants to merge 1 commit intomainfrom
feat/redis-resilience-retries

ChiragAgg5k commented Apr 22, 2026

Uh oh!

greptile-apps Bot commented Apr 22, 2026

Uh oh!

greptile-apps Bot Apr 22, 2026

Uh oh!

greptile-apps Bot Apr 22, 2026

Uh oh!

greptile-apps Bot Apr 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

	} catch (\RedisException $e) {
	} catch (\RedisException\|\RedisClusterException $e) {

Conversation

ChiragAgg5k commented Apr 22, 2026

Summary

Motivation

What changed

src/Queue/Connection/Redis.php

src/Queue/Connection/RedisCluster.php

src/Queue/Broker/Redis.php

Swoole considerations

Design notes

Test plan

Out of scope / follow-ups

Uh oh!

greptile-apps Bot commented Apr 22, 2026

Greptile Summary

Confidence Score: 4/5

Important Files Changed

Uh oh!

greptile-apps Bot Apr 22, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps Bot Apr 22, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps Bot Apr 22, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

`src/Queue/Connection/Redis.php`

`src/Queue/Connection/RedisCluster.php`

`src/Queue/Broker/Redis.php`