Skip to content

Fix data race on should_stop_ flag in LLM runner#18652

Open
kirklandsign wants to merge 2 commits intomainfrom
android/fix-should-stop-data-race
Open

Fix data race on should_stop_ flag in LLM runner#18652
kirklandsign wants to merge 2 commits intomainfrom
android/fix-should-stop-data-race

Conversation

@kirklandsign
Copy link
Copy Markdown
Contributor

Summary

should_stop_ is written from the caller thread via stop() and read from the inference thread in the generate loop. A plain bool without synchronization is undefined behavior per the C++ standard and can cause the compiler to optimize away the cross-thread visibility on ARM targets.

Change bool to std::atomic with relaxed memory ordering, which is sufficient for a simple cancellation flag and has negligible overhead.

Test plan

CI

should_stop_ is written from the caller thread via stop() and read from
the inference thread in the generate loop. A plain bool without
synchronization is undefined behavior per the C++ standard and can cause
the compiler to optimize away the cross-thread visibility on ARM targets.

Change bool to std::atomic<bool> with relaxed memory ordering, which is
sufficient for a simple cancellation flag and has negligible overhead.
@pytorch-bot
Copy link
Copy Markdown

pytorch-bot bot commented Apr 1, 2026

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/18652

Note: Links to docs will display an error until the docs builds have been completed.

❌ 1 New Failure, 2 Unrelated Failures

As of commit d03f001 with merge base b24535b (image):

NEW FAILURE - The following job has failed:

FLAKY - The following jobs failed but were likely due to flakiness present on trunk:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@meta-cla meta-cla bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Apr 1, 2026
@github-actions
Copy link
Copy Markdown

github-actions bot commented Apr 1, 2026

This PR needs a release notes: label

If your change should be included in the release notes (i.e. would users of this library care about this change?), please use a label starting with release notes:. This helps us keep track and include your important work in the next release notes.

To add a label, you can comment to pytorchbot, for example
@pytorchbot label "release notes: none"

For more information, see
https://github.com/pytorch/pytorch/wiki/PyTorch-AutoLabel-Bot#why-categorize-for-release-notes-and-how-does-it-work.

@kirklandsign kirklandsign marked this pull request as ready for review April 1, 2026 21:11
Copilot AI review requested due to automatic review settings April 1, 2026 21:11
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Fixes a cross-thread cancellation data race in the LLM text generation path by making stop flags atomic, improving correctness on weakly-ordered architectures (e.g., ARM).

Changes:

  • Convert TextTokenGenerator::should_stop_ to std::atomic<bool> and use relaxed loads/stores in the generate loop and stop().
  • Convert TextDecoderRunner::should_stop_ to std::atomic<bool> and update stop() to use an atomic store.
  • Convert TextLLMRunner::shouldStop_ to std::atomic<bool> (though it currently appears unused).

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 2 comments.

File Description
extension/llm/runner/text_token_generator.h Makes the generation cancellation flag atomic with relaxed load/store usage.
extension/llm/runner/text_llm_runner.h Changes runner-level stop flag to atomic (but currently not wired into behavior).
extension/llm/runner/text_decoder_runner.h Changes decoder-level stop flag to atomic; stop() now does an atomic store.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines 164 to 166
private:
bool shouldStop_{false};
std::atomic<bool> shouldStop_{false};

Copy link

Copilot AI Apr 1, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shouldStop_ was changed to std::atomic<bool>, but it’s not referenced anywhere else in TextLLMRunner (stop() delegates to text_token_generator_->stop() and generate() doesn’t check shouldStop_). This adds an unused state flag and <atomic> dependency without affecting cancellation behavior. Either remove shouldStop_ entirely, or wire it into stop() / generate() (with consistent memory ordering) so it actually controls generation cancellation at the runner level.

Copilot uses AI. Check for mistakes.
should_stop_ = true;
should_stop_.store(true, std::memory_order_relaxed);
}

Copy link

Copilot AI Apr 1, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TextDecoderRunner::stop() sets should_stop_, but should_stop_ is never read in TextDecoderRunner::step() (or elsewhere), so calling stop() currently has no effect on execution/cancellation. Either remove this stop flag/API if it’s not meant to be supported, or add a check (e.g., early-return with a cancelled/aborted error) in step() (and/or in the outer loop) that observes should_stop_.

Suggested change
inline bool is_stopped() const {
return should_stop_.load(std::memory_order_relaxed);
}

Copilot uses AI. Check for mistakes.
…nner

These flags were written but never read — cancellation is handled
entirely through TextTokenGenerator::should_stop_. Removing dead state
and the unnecessary <atomic> dependency from these two classes.
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 4 out of 4 changed files in this pull request and generated 3 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

token_data.data(), token_shape, executorch::aten::ScalarType::Long);

should_stop_ = false;
should_stop_.store(false, std::memory_order_relaxed);
Copy link

Copilot AI Apr 2, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should_stop_ is reset to false inside generate(). If stop() is called from another thread very early (e.g., immediately after generation starts), the subsequent store(false, ...) here can overwrite that stop request, making cancellation unreliable for that run. Consider resetting the flag before the operation becomes externally stoppable (or track cancellation via a generation id / use exchange with a protocol that can’t lose a concurrent stop request).

Suggested change
should_stop_.store(false, std::memory_order_relaxed);
// Clear any stale stop request from a previous run without losing a
// concurrent early stop for this run. If a stop was already requested,
// honor it immediately for this generation call.
if (should_stop_.exchange(false, std::memory_order_relaxed)) {
return 0;
}

Copilot uses AI. Check for mistakes.
Comment on lines 146 to 148
inline void stop() {
should_stop_ = true;
should_stop_.store(true, std::memory_order_relaxed);
}
Copy link

Copilot AI Apr 2, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are existing unit tests for the runner/token generation path (e.g., test_text_llm_runner.cpp), but none appear to cover calling stop() concurrently with generate() to validate cancellation behavior and prevent regressions of this race fix. Adding a focused test (potentially in Python bindings where the GIL is released) would better exercise the cross-thread stop path.

Copilot uses AI. Check for mistakes.
should_stop_ = true;
}

/**
Copy link

Copilot AI Apr 2, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removing TextDecoderRunner::stop() is an API-breaking change for a public header (it’s exported and also has a torch::executor alias below). If downstream code may be calling this, consider keeping the method (even as a deprecated no-op) or providing a migration path rather than deleting it outright.

Suggested change
/**
/**
* Deprecated compatibility shim for older callers. TextDecoderRunner no
* longer requires explicit stop behavior, so this method is now a no-op.
*/
[[deprecated(
"TextDecoderRunner::stop() is deprecated and is now a no-op; remove "
"calls to this method.")]] virtual void stop() {}
/**

Copilot uses AI. Check for mistakes.
@mergennachin
Copy link
Copy Markdown
Contributor

@claude Review this PR

@claude
Copy link
Copy Markdown

claude bot commented Apr 3, 2026

Claude finished @mergennachin's task in 6m 57s —— View job


PR Review: Fix data race on should_stop_ flag in LLM runner

  • Read changed files and understand the diff
  • Check for correctness of atomic usage
  • Review removal of unused flags
  • Check CI failure
  • Post review feedback

Overall: This is a clean, well-scoped fix. The core change (commit 1) correctly addresses a real data race, and the cleanup (commit 2) removes dead code. LGTM with one minor note.

Commit 1: std::atomic<bool> with relaxed ordering

The fix in text_token_generator.h is correct:

  • should_stop_ is written from the caller thread via stop() and read from the inference thread in the generate() loop — a textbook data race on plain bool.
  • std::memory_order_relaxed is the right choice here. The flag is a simple cancellation signal with no ordering dependencies on other shared data. Relaxed atomics guarantee atomicity (no torn reads/writes) and eventual cross-core visibility, which is all that's needed for a stop flag.
  • TextTokenGenerator is always held via std::unique_ptr, so std::atomic<bool> making the class non-copyable/non-movable is not an issue.

Commit 2: Removing unused should_stop_ flags

  • TextDecoderRunner::should_stop_ — was set by stop() but never read anywhere (not in step() or any other method). The flag had no effect on execution. Clean removal.
  • TextLLMRunner::shouldStop_ — was reset in generate() but never checked. TextLLMRunner::stop() already delegates to text_token_generator_->stop(), so this flag was entirely redundant. Clean removal.
  • No external callers of TextDecoderRunner::stop() exist in the codebase (all stop() calls go through the runner-level APIs which delegate to TextTokenGenerator::stop()), so removing it is safe.

Minor note

The ASCII diagram in extension/llm/runner/README.md:532 still shows should_stop_ as a member of TextDecoderRunner and MultimodalDecoderRunner. It would be worth updating the diagram to reflect that should_stop_ now lives solely in TextTokenGenerator. This is cosmetic and doesn't need to block the PR.

CI

The initially reported unittest-editable / linux failure (exit code 56) does not appear related to this change — CI is now fully green (13/13 passing, 1 cancelled).

@mergennachin
Copy link
Copy Markdown
Contributor

@claude Explain me the reasoning of the race before this PR and how this actually solves it

@claude
Copy link
Copy Markdown

claude bot commented Apr 3, 2026

Claude finished @mergennachin's task in 5m 20s —— View job


  • Read the changed files and understand the before/after
  • Explain the race condition
  • Explain how the fix works

The race before this PR

The core issue is in TextTokenGenerator (text_token_generator.h). Two threads interact with a single bool should_stop_ field:

Thread A (inference thread) — runs generate(), which contains a tight loop:

should_stop_ = false;              // reset at start of generate()

while (pos < start_pos + max_new_tokens) {
    // ... run model step, sample token, decode ...

    if (should_stop_) {            // read should_stop_
        break;
    }
}

Thread B (caller/UI thread) — calls stop() at any time to cancel generation:

void stop() {
    should_stop_ = true;           // write should_stop_
}

There is no synchronization between these two threads. Under the C++ memory model ([intro.races]/21), when two threads access the same non-atomic variable and at least one access is a write, that's a data race, and the behavior is undefined — not just "might read a stale value" but fully UB, meaning the compiler and hardware are free to do anything.

In practice, the consequences are architecture-dependent:

  1. Compiler optimization (any platform): Because should_stop_ is a plain bool with no synchronization, the compiler is within its rights to assume no other thread modifies it. It can hoist the read of should_stop_ out of the loop entirely, turning if (should_stop_) break; into a single check before the loop starts — after which the loop can never observe the stop request regardless of what thread B does. This is a well-known optimization that GCC and Clang actually perform at -O2 and above (the variable gets loaded into a register once and the register is re-checked each iteration, never re-reading from memory).

  2. Store buffer visibility (ARM): ARM has a weakly-ordered memory model. A write on one core sits in that core's store buffer before being flushed to the cache coherence domain. Without a fence or atomic operation, there is no guarantee that thread A's core will ever see thread B's write in a timely manner — or at all, from the compiler's perspective, since it's UB.

How std::atomic<bool> fixes it

The fix changes the declaration from:

bool should_stop_ = false;

to:

std::atomic<bool> should_stop_{false};

And all accesses use explicit atomic operations:

// In generate() — reset:
should_stop_.store(false, std::memory_order_relaxed);

// In generate() — check:
if (should_stop_.load(std::memory_order_relaxed)) { break; }

// In stop() — signal:
should_stop_.store(true, std::memory_order_relaxed);

This fixes the race because:

  1. No more UB. std::atomic operations are explicitly defined to be data-race-free. Concurrent loads and stores on the same atomic variable are well-defined behavior in the C++ standard.

  2. Guaranteed atomicity. The compiler must emit load/store instructions that are indivisible (no torn reads). On ARM this maps to plain ldr/str of an aligned 32-bit word, which is inherently atomic at the hardware level.

  3. No hoisting out of the loop. The compiler treats every atomic::load() as potentially returning a different value (even with relaxed ordering). It cannot cache the result in a register and skip re-reading. Every loop iteration re-executes the load from memory.

  4. Why relaxed is sufficient. memory_order_relaxed provides only atomicity — no ordering guarantees relative to other memory operations. That's all we need here because should_stop_ is a standalone cancellation flag. There's no other shared data whose visibility needs to be ordered relative to seeing the stop flag. Stronger orderings (acquire/release, seq_cst) would add unnecessary fence instructions. On ARM, relaxed atomic loads/stores compile to plain ldr/str with no barriers — same instructions as the original code, but with the critical difference that the compiler now knows not to optimize away the cross-thread communication.

In short: the original code was UB that happened to work in unoptimized builds but could silently break under compiler optimizations or on weakly-ordered architectures. The fix makes the cross-thread signaling well-defined with zero runtime overhead.


Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants