Skip to content

fix(api_call): protect node rotation and health state with a mutex#59

Open
erimicel wants to merge 1 commit intotypesense:masterfrom
OLIOEX:thread-safe-node-rotation
Open

fix(api_call): protect node rotation and health state with a mutex#59
erimicel wants to merge 1 commit intotypesense:masterfrom
OLIOEX:thread-safe-node-rotation

Conversation

@erimicel
Copy link
Copy Markdown
Contributor

@erimicel erimicel commented Apr 27, 2026

Problem

Typesense::ApiCall keeps its rotation cursor (@current_node_index) and per-node health metadata (node[:is_healthy], node[:last_access_timestamp]) as plain unsynchronised instance state. When several threads share a single Typesense::Client — the standard pattern in Puma, Sidekiq, etc. — two concurrent calls to next_node can compute the same incremented index from the same starting value, causing the rotation to skip a node. A failing request can also flip is_healthy to false while another thread is mid-way through reading the pair, observing a stale last_access_timestamp with the new health flag.

The race is small in MRI thanks to the GVL, but it is real and grows under load. With keep-alive added recently, more state flows through the same path, so it's worth tightening up.

Fix

Add a single per-instance Mutex and use it to:

  • protect the round-robin loop in next_node (only the cursor advance + the healthy/due check) so only one thread mutates @current_node_index at a time;
  • atomically pair the is_healthy + last_access_timestamp writes in set_node_healthcheck.

The lock is acquired briefly — no I/O inside it — so contention is negligible.

Single-node deployments

The fix is safe and beneficial for single-node setups too: the rotation race is degenerate there ((idx + 1) % 1 == 0), but the health-state pair still benefits from atomic writes when a request fails and sets is_healthy: false.

Tests

Adds two regression specs in spec/typesense/api_call_spec.rb:

  • multi-node: 16 threads × 90 iterations across 3 nodes; asserts each node is selected exactly 480 times.
  • single-node: 8 threads concurrently flipping is_healthy and calling next_node; asserts the final state is internally consistent.

bundle exec rspec spec/typesense/api_call_spec.rb — 44 examples, 0 failures.
bundle exec rubocop lib/typesense/api_call.rb spec/typesense/api_call_spec.rb — clean.

PR Checklist

`@current_node_index` and per-node `is_healthy`/`last_access_timestamp`
were unsynchronised. With multiple threads sharing one client (Puma,
Sidekiq), two threads in `next_node` could compute the same incremented
index from the same starting value (skipping a node), and a healthcheck
write could be observed mid-update with a stale timestamp.

Wraps the round-robin loop in `next_node` and the writes in
`set_node_healthcheck` in a single per-instance Mutex. The mutex is
acquired briefly (no I/O inside it), so contention is negligible. Adds
regression specs covering concurrent multi-node rotation and concurrent
single-node health updates.
@tharropoulos
Copy link
Copy Markdown
Contributor

Test coverage looks good for the regression intent, though the exact even distribution is proving serialized round-robin behavior more than general thread safety.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants