[Bugfix] Fix OOM caused by cumem allocator inflating memory_reserved()#37111
[Bugfix] Fix OOM caused by cumem allocator inflating memory_reserved()#37111haosdent wants to merge 1 commit intovllm-project:mainfrom
Conversation
There was a problem hiding this comment.
Code Review
This pull request addresses an Out-Of-Memory (OOM) error caused by inaccurate memory measurement when using the cumem allocator. The fix correctly changes the memory profiling logic to use mem_get_info() instead of the problematic memory_reserved(), providing a more accurate measure of consumed memory and preventing over-allocation of the KV cache. The changes are consistently propagated through the memory calculation logic in gpu_worker.py, and a test case is updated to validate the new behavior. The implementation appears correct and effectively addresses the described bug. I have no further comments as the changes are sound.
|
This pull request has merge conflicts that must be resolved before it can be |
960b2c1 to
b27b401
Compare
When the cumem allocator's cleanup bypasses PyTorch's allocator tracking via direct cuMemUnmap, memory_reserved() becomes inflated, making non_torch_memory negative and underestimating non-KV cache memory usage. This causes OOM on large models (e.g. gpt-oss-120b on GH200 144GB). Replace the memory_reserved()-based formula with a mem_get_info()-based measurement (total_consumed) that is always accurate regardless of which allocator is used. Use transient_peak_headroom (torch_peak - torch_allocated) instead of torch_peak_increase to avoid double-counting persistent torch allocations already included in total_consumed. Fixes vllm-project#37096 Signed-off-by: haosdent <haosdent@gmail.com>
There was a problem hiding this comment.
cc @tjtanaa @jikunshang @JartX will this PR cause issues on ROCm? It's relying on mem_get_info()
|
Hi @MatthewBonanni, I just tested the PR on my ROCM setup and it loaded correctly. Thanks for mentioning me.😊 @AndreasKaratzas |
|
This pull request has merge conflicts that must be resolved before it can be |
|
This pull request has merge conflicts that must be resolved before it can be |
Purpose
When the cumem allocator's cleanup (
use_memory_pool()exit incumem.py) manually callsunmap_and_release()to free cached blocks, it bypasses PyTorch's allocator tracking. This inflatestorch.cuda.memory_reserved(), makingnon_torch_memory(cuda_memory - memory_reserved()) go deeply negative. The downstream effect is thatnon_kv_cache_memoryis underestimated, causing vLLM to over-allocate KV cache and OOM on large models.This was exposed after #32947 when a Python syntax bug fix (
with A and B:→with (A, B):) properly enabled the cumem allocator for model weight loading.The fix replaces the
memory_reserved()-based formula with amem_get_info()-based measurement (total_consumed = before_create.free_memory - after_profile.free_memory). The CUDA driver'smem_get_info()always returns accurate physical free/total memory regardless of which allocator is used. Additionally, usestransient_peak_headroom(torch_peak - torch_allocated) instead oftorch_peak_increaseto avoid double-counting persistent torch allocations already included intotal_consumed.Fixes #37096
Test Plan
test_memory_profiling_persistent_torchto verify persistent torch allocations are not double-counted innon_kv_cache_memory.non_torch_increasetotal_consumed(mem_get_info)non_kv_cache_memoryTest Result