-
Notifications
You must be signed in to change notification settings - Fork 852
[Bug]: NVRTC JIT compilation fails on CUDA 12.8 (smxx_clean_logits.cuh: expression must have a constant value) #295
Description
Describe the bug
When running DeepGEMM on CUDA 12.8, the fp8_mqa_logits kernel fails to JIT-compile via NVRTC. The compiler throws an error stating that cute::numeric_limits<float>::infinity() cannot be used as a constexpr.
Error Log
NVRTC log: "kernel.cu": creating precompiled header file "kernel.pch"
/usr/local/lib/python3.12/dist-packages/deep_gemm/include/deep_gemm/impls/smxx_clean_logits.cuh(17): error: expression must have a constant value
constexpr float neg_inf = -cute::numeric_limits<float>::infinity();
^
/usr/local/cuda/include/cuda/std/detail/libcxx/include/limits(686): note #2703-D: cannot call non-constexpr function "cuda::std::__4::__libcpp_numeric_limits<float, true>::infinity"
Root Cause & Investigation
In CUDA 12.8 device code, the underlying implementation of cuda::std::numeric_limits<float>::infinity() inside libcxx/include/limits is missing the constexpr qualifier under NVRTC's stripped-down compilation environment.
What we tried before finding the fix
- Cleared JIT cache (
/root/.deep_gemm,/workspace/.deep_gemm_cache/cache). - Reinstalled DeepGEMM non-editable (
pip install --no-build-isolation .). - Removed stale editable install artifacts (
.egg-info,.pthfiles). - Set
CPLUS_INCLUDE_PATHto CUTLASS headers (NVRTC ignores it). - Replaced with standard built-ins (
__builtin_inff()) — NVRTC strips host-side built-ins and throws an undefined identifier error. - Replaced with native CUDA intrinsics (
__int_as_float(0xff800000)) — NVRTC refuses to evaluate it as a constant expression. - Implemented a PyTorch
try/exceptfallback usingtorch.einsumto bypass the crashed kernel — this successfully unblocked the model run, confirming the JIT failure was the only roadblock. - Only a
-1e38fraw literal worked as a true compile-time constant for NVRTC.
Proposed Fix / Workaround
To maintain compatibility with CUDA 12.8 without requiring users to upgrade to 12.9, replacing the standard library call with a raw float literal fixes the NVRTC compilation instantly.
In deep_gemm/include/deep_gemm/impls/smxx_clean_logits.cuh (Line 17):
Change from:
constexpr float neg_inf = -cute::numeric_limits<float>::infinity();
Change to:
constexpr float neg_inf = -1e38f;
Since -1e38f mathematically acts as negative infinity to mask out attention scores during the Softmax step, it preserves accuracy while bypassing the NVRTC constexpr limitation.
Environment
- DeepGEMM Version: Version 2.3.0, commit d30fc36
- GPU: 1x NVIDIA H100 SXM (80GB HBM3, SM90)
- System Resources: 26 vCPUs, 251 GB Memory
- Environment: Docker container (Template:
runpod-torch-v280) - CUDA Version: 12.8 (System)
- Python Version: 3.12
- PyTorch Version: 2.8.0+cu128