Fix data race on should_stop_ flag in LLM runner#18652
Fix data race on should_stop_ flag in LLM runner#18652kirklandsign wants to merge 2 commits intomainfrom
Conversation
should_stop_ is written from the caller thread via stop() and read from the inference thread in the generate loop. A plain bool without synchronization is undefined behavior per the C++ standard and can cause the compiler to optimize away the cross-thread visibility on ARM targets. Change bool to std::atomic<bool> with relaxed memory ordering, which is sufficient for a simple cancellation flag and has negligible overhead.
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/18652
Note: Links to docs will display an error until the docs builds have been completed. ❌ 1 New Failure, 2 Unrelated FailuresAs of commit d03f001 with merge base b24535b ( NEW FAILURE - The following job has failed:
FLAKY - The following jobs failed but were likely due to flakiness present on trunk:
This comment was automatically generated by Dr. CI and updates every 15 minutes. |
This PR needs a
|
There was a problem hiding this comment.
Pull request overview
Fixes a cross-thread cancellation data race in the LLM text generation path by making stop flags atomic, improving correctness on weakly-ordered architectures (e.g., ARM).
Changes:
- Convert
TextTokenGenerator::should_stop_tostd::atomic<bool>and use relaxed loads/stores in the generate loop andstop(). - Convert
TextDecoderRunner::should_stop_tostd::atomic<bool>and updatestop()to use an atomic store. - Convert
TextLLMRunner::shouldStop_tostd::atomic<bool>(though it currently appears unused).
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 2 comments.
| File | Description |
|---|---|
| extension/llm/runner/text_token_generator.h | Makes the generation cancellation flag atomic with relaxed load/store usage. |
| extension/llm/runner/text_llm_runner.h | Changes runner-level stop flag to atomic (but currently not wired into behavior). |
| extension/llm/runner/text_decoder_runner.h | Changes decoder-level stop flag to atomic; stop() now does an atomic store. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| private: | ||
| bool shouldStop_{false}; | ||
| std::atomic<bool> shouldStop_{false}; | ||
|
|
There was a problem hiding this comment.
shouldStop_ was changed to std::atomic<bool>, but it’s not referenced anywhere else in TextLLMRunner (stop() delegates to text_token_generator_->stop() and generate() doesn’t check shouldStop_). This adds an unused state flag and <atomic> dependency without affecting cancellation behavior. Either remove shouldStop_ entirely, or wire it into stop() / generate() (with consistent memory ordering) so it actually controls generation cancellation at the runner level.
| should_stop_ = true; | ||
| should_stop_.store(true, std::memory_order_relaxed); | ||
| } | ||
|
|
There was a problem hiding this comment.
TextDecoderRunner::stop() sets should_stop_, but should_stop_ is never read in TextDecoderRunner::step() (or elsewhere), so calling stop() currently has no effect on execution/cancellation. Either remove this stop flag/API if it’s not meant to be supported, or add a check (e.g., early-return with a cancelled/aborted error) in step() (and/or in the outer loop) that observes should_stop_.
| inline bool is_stopped() const { | |
| return should_stop_.load(std::memory_order_relaxed); | |
| } |
…nner These flags were written but never read — cancellation is handled entirely through TextTokenGenerator::should_stop_. Removing dead state and the unnecessary <atomic> dependency from these two classes.
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 4 out of 4 changed files in this pull request and generated 3 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| token_data.data(), token_shape, executorch::aten::ScalarType::Long); | ||
|
|
||
| should_stop_ = false; | ||
| should_stop_.store(false, std::memory_order_relaxed); |
There was a problem hiding this comment.
should_stop_ is reset to false inside generate(). If stop() is called from another thread very early (e.g., immediately after generation starts), the subsequent store(false, ...) here can overwrite that stop request, making cancellation unreliable for that run. Consider resetting the flag before the operation becomes externally stoppable (or track cancellation via a generation id / use exchange with a protocol that can’t lose a concurrent stop request).
| should_stop_.store(false, std::memory_order_relaxed); | |
| // Clear any stale stop request from a previous run without losing a | |
| // concurrent early stop for this run. If a stop was already requested, | |
| // honor it immediately for this generation call. | |
| if (should_stop_.exchange(false, std::memory_order_relaxed)) { | |
| return 0; | |
| } |
| inline void stop() { | ||
| should_stop_ = true; | ||
| should_stop_.store(true, std::memory_order_relaxed); | ||
| } |
There was a problem hiding this comment.
There are existing unit tests for the runner/token generation path (e.g., test_text_llm_runner.cpp), but none appear to cover calling stop() concurrently with generate() to validate cancellation behavior and prevent regressions of this race fix. Adding a focused test (potentially in Python bindings where the GIL is released) would better exercise the cross-thread stop path.
| should_stop_ = true; | ||
| } | ||
|
|
||
| /** |
There was a problem hiding this comment.
Removing TextDecoderRunner::stop() is an API-breaking change for a public header (it’s exported and also has a torch::executor alias below). If downstream code may be calling this, consider keeping the method (even as a deprecated no-op) or providing a migration path rather than deleting it outright.
| /** | |
| /** | |
| * Deprecated compatibility shim for older callers. TextDecoderRunner no | |
| * longer requires explicit stop behavior, so this method is now a no-op. | |
| */ | |
| [[deprecated( | |
| "TextDecoderRunner::stop() is deprecated and is now a no-op; remove " | |
| "calls to this method.")]] virtual void stop() {} | |
| /** |
|
@claude Review this PR |
|
Claude finished @mergennachin's task in 6m 57s —— View job PR Review: Fix data race on should_stop_ flag in LLM runner
Overall: This is a clean, well-scoped fix. The core change (commit 1) correctly addresses a real data race, and the cleanup (commit 2) removes dead code. LGTM with one minor note. Commit 1:
|
|
@claude Explain me the reasoning of the race before this PR and how this actually solves it |
|
Claude finished @mergennachin's task in 5m 20s —— View job
The race before this PRThe core issue is in Thread A (inference thread) — runs should_stop_ = false; // reset at start of generate()
while (pos < start_pos + max_new_tokens) {
// ... run model step, sample token, decode ...
if (should_stop_) { // read should_stop_
break;
}
}Thread B (caller/UI thread) — calls void stop() {
should_stop_ = true; // write should_stop_
}There is no synchronization between these two threads. Under the C++ memory model ([intro.races]/21), when two threads access the same non-atomic variable and at least one access is a write, that's a data race, and the behavior is undefined — not just "might read a stale value" but fully UB, meaning the compiler and hardware are free to do anything. In practice, the consequences are architecture-dependent:
How
|
Summary
should_stop_ is written from the caller thread via stop() and read from the inference thread in the generate loop. A plain bool without synchronization is undefined behavior per the C++ standard and can cause the compiler to optimize away the cross-thread visibility on ARM targets.
Change bool to std::atomic with relaxed memory ordering, which is sufficient for a simple cancellation flag and has negligible overhead.
Test plan
CI