Skip to content

Fix IBM J9 crash: replace pthread_cleanup_push with abi::__forced_unwind catch (SCP-1154)#492

Open
jbachorik wants to merge 4 commits intomainfrom
muse/impl-20260420-123402
Open

Fix IBM J9 crash: replace pthread_cleanup_push with abi::__forced_unwind catch (SCP-1154)#492
jbachorik wants to merge 4 commits intomainfrom
muse/impl-20260420-123402

Conversation

@jbachorik
Copy link
Copy Markdown
Collaborator

@jbachorik jbachorik commented Apr 20, 2026

What does this PR do?:
Fixes a JVM crash on IBM J9 / OpenJ9 when profiling is enabled (SCP-1154, regression from ddprof 1.39.0).

Replaces pthread_cleanup_push/pop in both start_routine_wrapper variants with an explicit catch(abi::__forced_unwind&) guard. Also adds the same guard to J9WallClock::timerLoop to ensure VM::detachThread() is always called when J9 cancels the sampler thread during JVM shutdown.

Motivation:
IBM J9's thread teardown raises _Unwind_ForcedUnwind via libgcc (sourced from libj9thr29.so). In C++ mode, pthread_cleanup_push expands to a __pthread_cleanup_class object whose destructor is implicitly noexcept (C++11). When J9's forced-unwind propagates through this noexcept context, std::terminate()abort() is called, crashing the JVM.

This was a regression introduced in commit 2063c659 ("Prevent potential race in thread startup and cleanup dead code"), which added pthread_cleanup_push/pop to start_routine_wrapper. The fix replaces that mechanism with an explicit catch(abi::__forced_unwind&) + throw; pattern, which is the GCC/glibc-recommended approach for cleanup code that must run during POSIX thread cancellation.

Root cause (from Reshmi Anand's investigation):

  • Crash stack: abort() ← libjavaProfiler.so ← _Unwind_ForcedUnwind ← libj9thr29.so
  • Version bisect: ddprof 1.37.0 (no crash) → 1.39.0 (crash)
  • Trigger: -Ddd.profiling.enabled=true on IBM WAS 9.0.5.16 with OpenJDK 1.8.0_462

Additional Notes:

  • abi::__forced_unwind is correct: J9 calls _Unwind_ForcedUnwind from libgcc (confirmed by crash stack frame _Unwind_ForcedUnwind ← libj9thr29.so), which uses the GCC/Itanium ABI __forced_unwind type.
  • pthread_exit() is also covered: on glibc it raises its own __forced_unwind.
  • The thread_cleanup static helper is removed (no longer needed).
  • In start_routine_wrapper_spec (aarch64), tid is now captured before the try block to avoid the lazy-allocating ProfiledThread::current() path inside the catch.
  • J9WallClock::timerLoop PushLocalFrame/PopLocalFrame balance: if cancellation fires between push/pop, DetachCurrentThread (via VM::detachThread()) releases the outstanding JNI local frame per the JNI spec.

How to test the change?:
New regression test ddprof-lib/src/test/cpp/forced_unwind_ut.cpp runs on Linux without a JVM and verifies:

  1. catch(abi::__forced_unwind&) fires on pthread_cancel and the cleanup block runs.
  2. The throw; re-throw lets the thread exit as PTHREAD_CANCELED (not via std::terminate()).
  3. ProfiledThread::release() completes safely inside the catch block.
  4. pthread_exit() (which also uses __forced_unwind on glibc) is caught by the same pattern.

Run: ./gradlew :ddprof-lib:gtestRelease --tests "ForcedUnwindTest*"

End-to-end IBM J9 reproduction requires the Docker-based reproducer from the SCP-1154 investigation artifacts.

For Datadog employees:

  • If this PR touches code that signs or publishes builds or packages, or handles
    credentials of any kind, I've requested a review from @DataDog/security-design-and-guidance.
  • This PR doesn't touch any of that.
  • JIRA: SCP-1154

🤖 Generated with Claude Code via muse implement

@jbachorik jbachorik added the AI label Apr 20, 2026
jbachorik and others added 2 commits April 20, 2026 15:42
…ind catch

SCP-1154 / regression from ddprof 1.39.0 (commit 2063c65).

pthread_cleanup_push in C++ mode creates __pthread_cleanup_class whose
destructor is implicitly noexcept.  When IBM J9's thread teardown raises
_Unwind_ForcedUnwind (via libgcc, sourced from libj9thr29.so), the C++
runtime calls std::terminate() -> abort() as the forced-unwind exception
propagates through the noexcept destructor context.

Replace pthread_cleanup_push/pop in both start_routine_wrapper variants
with an explicit catch(abi::__forced_unwind&) block that performs the same
cleanup (unregisterThread + release) and re-throws to let the thread exit
cleanly.  Apply the same guard to J9WallClock::timerLoop to ensure
VM::detachThread() is always called even when J9 cancels the sampler thread
during JVM shutdown.

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
Tests verify that catch(abi::__forced_unwind&)+rethrow correctly:
- intercepts pthread_cancel-triggered forced unwind and runs cleanup
- lets the thread exit as PTHREAD_CANCELED after rethrow
- keeps ProfiledThread::release() safe to call in the catch block
- also covers pthread_exit() which uses the same exception on glibc

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
@jbachorik jbachorik force-pushed the muse/impl-20260420-123402 branch from 8be1c43 to 574d47e Compare April 20, 2026 13:43
@jbachorik jbachorik marked this pull request as ready for review April 20, 2026 13:47
@jbachorik jbachorik requested a review from a team as a code owner April 20, 2026 13:47
@jbachorik jbachorik requested a review from zhengyu123 April 20, 2026 13:47
@dd-octo-sts
Copy link
Copy Markdown
Contributor

dd-octo-sts Bot commented Apr 20, 2026

CI Test Results

Run: #24728001460 | Commit: 498f41d | Duration: 25m 23s (longest job)

All 32 test jobs passed

Status Overview

JDK glibc-aarch64/debug glibc-amd64/debug musl-aarch64/debug musl-amd64/debug
8 - - -
8-ibm - - -
8-j9 - -
8-librca - -
8-orcl - - -
11 - - -
11-j9 - -
11-librca - -
17 - -
17-graal - -
17-j9 - -
17-librca - -
21 - -
21-graal - -
21-librca - -
25 - -
25-graal - -
25-librca - -

Legend: ✅ passed | ❌ failed | ⚪ skipped | 🚫 cancelled

Summary: Total: 32 | Passed: 32 | Failed: 0


Updated: 2026-04-21 14:55:38 UTC

@zhengyu123
Copy link
Copy Markdown
Contributor

Claude found two relevant issues:

  • j9WallClock.cpp:60
    VM::attachThread and malloc run before the try{} block — if a forced-unwind fires in that narrow window, cleanup is skipped despite the comment claiming otherwise.

  • libraryPatcher_linux.cpp :97
    On aarch64/musl/jdk11 path, init_thread_tls() and start_window_and_register() use separate SignalBlocker scopes — signals are unblocked between the two calls, creating a gap where the init-window guard isn't set.

- j9WallClock.cpp: move VM::attachThread and malloc into the try block
  (jni/frames pre-initialized to null) so catch(abi::__forced_unwind&)
  runs cleanup if the unwind fires during setup.
- libraryPatcher_linux.cpp: merge init_thread_tls and start_window_and_register
  into a single noinline init_tls_and_register that holds one SignalBlocker
  across initCurrentThread() + startInitWindow() + registerThread(),
  closing the signal-unblocked gap on aarch64/musl/jdk11.

Co-Authored-By: muse <muse@noreply>
@jbachorik
Copy link
Copy Markdown
Collaborator Author

Thanks — both gaps are real. Fixed in 1cb5f6c:

  • j9WallClock.cpp: VM::attachThread and malloc are now inside the try{} block, with jni and frames declared as nullptr before it so the catch handler's free(frames) (a no-op on null) and VM::detachThread() are safe if the unwind fires mid-setup. Comment updated accordingly.
  • libraryPatcher_linux.cpp: init_thread_tls() and start_window_and_register() merged into a single __attribute__((noinline)) init_tls_and_register() that holds one SignalBlocker spanning initCurrentThread() + startInitWindow() + registerThread(), closing the gap on aarch64/musl/jdk11. noinline is preserved so sigset_t still stays off start_routine_wrapper_spec's frame.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants