[Bugfix] Fix OOM caused by cumem allocator inflating memory_reserved() by haosdent · Pull Request #37111 · vllm-project/vllm

haosdent · 2026-03-15T17:23:20Z

Purpose

When the cumem allocator's cleanup (use_memory_pool() exit in cumem.py) manually calls unmap_and_release() to free cached blocks, it bypasses PyTorch's allocator tracking. This inflates torch.cuda.memory_reserved(), making non_torch_memory (cuda_memory - memory_reserved()) go deeply negative. The downstream effect is that non_kv_cache_memory is underestimated, causing vLLM to over-allocate KV cache and OOM on large models.

This was exposed after #32947 when a Python syntax bug fix (with A and B: → with (A, B):) properly enabled the cumem allocator for model weight loading.

The fix replaces the memory_reserved()-based formula with a mem_get_info()-based measurement (total_consumed = before_create.free_memory - after_profile.free_memory). The CUDA driver's mem_get_info() always returns accurate physical free/total memory regardless of which allocator is used. Additionally, uses transient_peak_headroom (torch_peak - torch_allocated) instead of torch_peak_increase to avoid double-counting persistent torch allocations already included in total_consumed.

Fixes #37096

Test Plan

Added test_memory_profiling_persistent_torch to verify persistent torch allocations are not double-counted in non_kv_cache_memory.
E2E verification with cumem allocator on NVIDIA GB10:

Metric	Without Fix	With Fix
`non_torch_increase`	-496.1 MiB (negative!)	N/A (not used)
`total_consumed` (mem_get_info)	N/A	527.9 MiB (accurate)
`non_kv_cache_memory`	15.9 MiB (underestimates ~49x)	783.9 MiB (correct)
Result	Over-allocates KV cache → OOM	Safe

Test Result

tests/utils_/test_mem_utils.py::test_memory_profiling_persistent_torch PASSED

gemini-code-assist

Code Review

This pull request addresses an Out-Of-Memory (OOM) error caused by inaccurate memory measurement when using the cumem allocator. The fix correctly changes the memory profiling logic to use mem_get_info() instead of the problematic memory_reserved(), providing a more accurate measure of consumed memory and preventing over-allocation of the KV cache. The changes are consistently propagated through the memory calculation logic in gpu_worker.py, and a test case is updated to validate the new behavior. The implementation appears correct and effectively addresses the described bug. I have no further comments as the changes are sound.

mergify · 2026-03-19T04:37:54Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @haosdent.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

When the cumem allocator's cleanup bypasses PyTorch's allocator tracking via direct cuMemUnmap, memory_reserved() becomes inflated, making non_torch_memory negative and underestimating non-KV cache memory usage. This causes OOM on large models (e.g. gpt-oss-120b on GH200 144GB). Replace the memory_reserved()-based formula with a mem_get_info()-based measurement (total_consumed) that is always accurate regardless of which allocator is used. Use transient_peak_headroom (torch_peak - torch_allocated) instead of torch_peak_increase to avoid double-counting persistent torch allocations already included in total_consumed. Fixes vllm-project#37096 Signed-off-by: haosdent <haosdent@gmail.com>

MatthewBonanni

cc @tjtanaa @jikunshang @JartX will this PR cause issues on ROCm? It's relying on mem_get_info()

@haosdent, see related PR #36720

jikunshang · 2026-03-19T14:57:27Z

not AMD expert, but I did notice this before #12624
cc @gshtras

JartX · 2026-03-19T15:12:51Z

Hi @MatthewBonanni, I just tested the PR on my ROCM setup and it loaded correctly. Thanks for mentioning me.😊 @AndreasKaratzas

mergify · 2026-03-20T17:59:55Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @haosdent.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

mergify · 2026-04-29T17:26:49Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @haosdent.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

mergify Bot added v1 bug Something isn't working labels Mar 15, 2026

haosdent force-pushed the fix-37096 branch from 9d29831 to 9db0eab Compare March 15, 2026 17:24

gemini-code-assist Bot reviewed Mar 15, 2026

View reviewed changes

haosdent mentioned this pull request Mar 15, 2026

[Bug]: v0.17.0-aarch64 onwards will run out of CUDA memory for gpt-oss-120b on GH200 144GB #37096

Open

haosdent changed the title ~~[WIP] [Bugfix] Fix OOM caused by cumem allocator inflating memory_reserved()~~ [Bugfix] Fix OOM caused by cumem allocator inflating memory_reserved() Mar 16, 2026

haosdent marked this pull request as ready for review March 16, 2026 03:32

haosdent requested a review from njhill as a code owner March 16, 2026 03:32

MatthewBonanni reviewed Mar 16, 2026

View reviewed changes

Comment thread vllm/utils/mem_utils.py Outdated

mergify Bot added the needs-rebase label Mar 19, 2026

haosdent force-pushed the fix-37096 branch 4 times, most recently from 960b2c1 to b27b401 Compare March 19, 2026 05:55

mergify Bot removed the needs-rebase label Mar 19, 2026

haosdent force-pushed the fix-37096 branch from b27b401 to 5f870fd Compare March 19, 2026 05:59

MatthewBonanni reviewed Mar 19, 2026

View reviewed changes

mergify Bot added the needs-rebase label Mar 20, 2026

mergify Bot removed the needs-rebase label Apr 29, 2026

mergify Bot added the needs-rebase label Apr 29, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Bugfix] Fix OOM caused by cumem allocator inflating memory_reserved()#37111

[Bugfix] Fix OOM caused by cumem allocator inflating memory_reserved()#37111
haosdent wants to merge 1 commit intovllm-project:mainfrom
haosdent:fix-37096

haosdent commented Mar 15, 2026 •

edited

Loading

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

Uh oh!

mergify Bot commented Mar 19, 2026

Uh oh!

MatthewBonanni left a comment •

edited

Loading

Uh oh!

jikunshang commented Mar 19, 2026

Uh oh!

JartX commented Mar 19, 2026

Uh oh!

mergify Bot commented Mar 20, 2026

Uh oh!

mergify Bot commented Apr 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Uh oh!

Conversation

haosdent commented Mar 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Test Plan

Test Result

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

mergify Bot commented Mar 19, 2026

Uh oh!

MatthewBonanni left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jikunshang commented Mar 19, 2026

Uh oh!

JartX commented Mar 19, 2026

Uh oh!

mergify Bot commented Mar 20, 2026

Uh oh!

mergify Bot commented Apr 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

haosdent commented Mar 15, 2026 •

edited

Loading

MatthewBonanni left a comment •

edited

Loading