Skip to content

[Bug]: NVRTC JIT compilation fails on CUDA 12.8 (smxx_clean_logits.cuh: expression must have a constant value) #295

@LilySu

Description

@LilySu

Describe the bug

When running DeepGEMM on CUDA 12.8, the fp8_mqa_logits kernel fails to JIT-compile via NVRTC. The compiler throws an error stating that cute::numeric_limits<float>::infinity() cannot be used as a constexpr.

Error Log

NVRTC log: "kernel.cu": creating precompiled header file "kernel.pch"
/usr/local/lib/python3.12/dist-packages/deep_gemm/include/deep_gemm/impls/smxx_clean_logits.cuh(17): error: expression must have a constant value
      constexpr float neg_inf = -cute::numeric_limits<float>::infinity();
                                ^
/usr/local/cuda/include/cuda/std/detail/libcxx/include/limits(686): note #2703-D: cannot call non-constexpr function "cuda::std::__4::__libcpp_numeric_limits<float, true>::infinity"

Root Cause & Investigation

In CUDA 12.8 device code, the underlying implementation of cuda::std::numeric_limits<float>::infinity() inside libcxx/include/limits is missing the constexpr qualifier under NVRTC's stripped-down compilation environment.

What we tried before finding the fix

  • Cleared JIT cache (/root/.deep_gemm, /workspace/.deep_gemm_cache/cache).
  • Reinstalled DeepGEMM non-editable (pip install --no-build-isolation .).
  • Removed stale editable install artifacts (.egg-info, .pth files).
  • Set CPLUS_INCLUDE_PATH to CUTLASS headers (NVRTC ignores it).
  • Replaced with standard built-ins (__builtin_inff()) — NVRTC strips host-side built-ins and throws an undefined identifier error.
  • Replaced with native CUDA intrinsics (__int_as_float(0xff800000)) — NVRTC refuses to evaluate it as a constant expression.
  • Implemented a PyTorch try/except fallback using torch.einsum to bypass the crashed kernel — this successfully unblocked the model run, confirming the JIT failure was the only roadblock.
  • Only a -1e38f raw literal worked as a true compile-time constant for NVRTC.

Proposed Fix / Workaround

To maintain compatibility with CUDA 12.8 without requiring users to upgrade to 12.9, replacing the standard library call with a raw float literal fixes the NVRTC compilation instantly.

In deep_gemm/include/deep_gemm/impls/smxx_clean_logits.cuh (Line 17):
Change from:
constexpr float neg_inf = -cute::numeric_limits<float>::infinity();

Change to:
constexpr float neg_inf = -1e38f;

Since -1e38f mathematically acts as negative infinity to mask out attention scores during the Softmax step, it preserves accuracy while bypassing the NVRTC constexpr limitation.

Environment

  • DeepGEMM Version: Version 2.3.0, commit d30fc36
  • GPU: 1x NVIDIA H100 SXM (80GB HBM3, SM90)
  • System Resources: 26 vCPUs, 251 GB Memory
  • Environment: Docker container (Template: runpod-torch-v280)
  • CUDA Version: 12.8 (System)
  • Python Version: 3.12
  • PyTorch Version: 2.8.0+cu128

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions