[Bug]: NVRTC JIT compilation fails on CUDA 12.8 (smxx_clean_logits.cuh: expression must have a constant value)

### Describe the bug
When running DeepGEMM on CUDA 12.8, the `fp8_mqa_logits` kernel fails to JIT-compile via NVRTC. The compiler throws an error stating that `cute::numeric_limits<float>::infinity()` cannot be used as a `constexpr`.

### Error Log
```text
NVRTC log: "kernel.cu": creating precompiled header file "kernel.pch"
/usr/local/lib/python3.12/dist-packages/deep_gemm/include/deep_gemm/impls/smxx_clean_logits.cuh(17): error: expression must have a constant value
      constexpr float neg_inf = -cute::numeric_limits<float>::infinity();
                                ^
/usr/local/cuda/include/cuda/std/detail/libcxx/include/limits(686): note #2703-D: cannot call non-constexpr function "cuda::std::__4::__libcpp_numeric_limits<float, true>::infinity"
```

### Root Cause & Investigation
In CUDA 12.8 device code, the underlying implementation of `cuda::std::numeric_limits<float>::infinity()` inside `libcxx/include/limits` is missing the `constexpr` qualifier under NVRTC's stripped-down compilation environment. 

### What we tried before finding the fix
- Cleared JIT cache (`/root/.deep_gemm`, `/workspace/.deep_gemm_cache/cache`).
- Reinstalled DeepGEMM non-editable (`pip install --no-build-isolation .`).
- Removed stale editable install artifacts (`.egg-info`, `.pth` files).
- Set `CPLUS_INCLUDE_PATH` to CUTLASS headers (NVRTC ignores it).
- Replaced with standard built-ins (`__builtin_inff()`) — NVRTC strips host-side built-ins and throws an undefined identifier error.
- Replaced with native CUDA intrinsics (`__int_as_float(0xff800000)`) — NVRTC refuses to evaluate it as a constant expression.
- Implemented a PyTorch `try/except` fallback using `torch.einsum` to bypass the crashed kernel — this successfully unblocked the model run, confirming the JIT failure was the only roadblock.
- **Only a `-1e38f` raw literal worked as a true compile-time constant for NVRTC.**

### Proposed Fix / Workaround
To maintain compatibility with CUDA 12.8 without requiring users to upgrade to 12.9, replacing the standard library call with a raw float literal fixes the NVRTC compilation instantly. 

In `deep_gemm/include/deep_gemm/impls/smxx_clean_logits.cuh` (Line 17):
**Change from:**
`constexpr float neg_inf = -cute::numeric_limits<float>::infinity();`

**Change to:**
`constexpr float neg_inf = -1e38f;`

Since `-1e38f` mathematically acts as negative infinity to mask out attention scores during the Softmax step, it preserves accuracy while bypassing the NVRTC `constexpr` limitation.

### Environment
* **DeepGEMM Version:** Version 2.3.0, commit d30fc36
* **GPU:** 1x NVIDIA H100 SXM (80GB HBM3, SM90)
* **System Resources:** 26 vCPUs, 251 GB Memory
* **Environment:** Docker container (Template: `runpod-torch-v280`)
* **CUDA Version:** 12.8 (System)
* **Python Version:** 3.12
* **PyTorch Version:** 2.8.0+cu128

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug]: NVRTC JIT compilation fails on CUDA 12.8 (smxx_clean_logits.cuh: expression must have a constant value) #295

Describe the bug

Error Log

Root Cause & Investigation

What we tried before finding the fix

Proposed Fix / Workaround

Environment

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Bug]: NVRTC JIT compilation fails on CUDA 12.8 (smxx_clean_logits.cuh: expression must have a constant value) #295

Description

Describe the bug

Error Log

Root Cause & Investigation

What we tried before finding the fix

Proposed Fix / Workaround

Environment

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions