Skip to content

[Gaudi][Model] Qwen2.5-vl #870

Merged
malkomes merged 34 commits intoHabanaAI:qwen2.5-vl-hpufrom
malkomes:qwen2.5-vl-hpu
Mar 17, 2025
Merged

[Gaudi][Model] Qwen2.5-vl #870
malkomes merged 34 commits intoHabanaAI:qwen2.5-vl-hpufrom
malkomes:qwen2.5-vl-hpu

Conversation

@malkomes
Copy link
Copy Markdown

@malkomes malkomes commented Feb 26, 2025

Initial enablement of Qwen2.5-vl for Gaudi HPU
Based on vllm-project#12604 it FIXES: vllm-project#12486, vllm-project#12532

  • Introduce the flag HPU_DISABLE_TENSOR_CACHE to set disable_tensor_cache in htorch.hpu.wrap_in_hpu_graph. It keeps the default value as True for all models but we set it to False for MRoPE models such as Qwen2.5-vl.
  • Computes MRoPE positions and deltas for the HPU model runner.

Note

Set PT_HPUGRAPH_DISABLE_TENSOR_CACHE=false to run qwen models, see README_GAUDI.
To install the VLLM with qwen2.5-VL enabled:

pip install -r requirements-hpu.txt; pip install -r requirements-hpu-qwen2_5_vl.txt ; python setup.py develop

--
Co-authored-by: Mohit Deopujari mohit.deopujari@intel.com
Co-authored-by: Jimin Ha jimin.ha@intel.com
Co-authored-by: Pallavi Jaini pallavi.jaini@intel.com
Co-authored-by: Deepak Narayana deepak.narayana@intel.com
Co-authored-by: Sayantan Sarkar sayantan.sarkar@intel.com
Co-authored-by: Iman Gohari s.m.iman.gohari@intel.com

Comment thread requirements-hpu-qwen2_5_vl.txt
@imangohari1
Copy link
Copy Markdown

I have clean cloned this branch and tested the qwen2.5-vl pytests.
all 12 tests pass. below are the details.

$ pip install -r requirements-hpu.txt; pip install -r requirements-hpu-qwen2_5_vl.txt ; python setup.py develop
$ VLLM_SKIP_WARMUP=true pytest tests/models/decoder_only/vision_language/test_models.py -s -v -k "[qwen2_5"
INFO 02-27 17:31:46 __init__.py:199] Automatically detected platform hpu.
================================================================================================================================================ test session starts =================================================================================================================================================
platform linux -- Python 3.10.12, pytest-8.3.4, pluggy-1.5.0 -- /usr/bin/python
cachedir: .pytest_cache
rootdir: /devops/sgohari/tests/jira/hs-4927/pr/vllm-fork
configfile: pyproject.toml
plugins: anyio-4.8.0, typeguard-4.3.0
collected 185 items / 173 deselected / 12 selected                                                                                                                                                                                                                                                                   

tests/models/decoder_only/vision_language/test_models.py::test_single_image_models[qwen2_5_vl-test_case28] INFO 02-27 17:31:59 config.py:548] This model supports multiple tasks: {'generate', 'embed', 'score', 'classify', 'reward'}. Defaulting to 'generate'.
INFO 02-27 17:31:59 llm_engine.py:234] Initializing a V0 LLM engine (v0.1.dev5293+gff97945) with config: model='Qwen/Qwen2.5-VL-3B-Instruct', speculative_config=None, tokenizer='Qwen/Qwen2.5-VL-3B-Instruct', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.bfloat16, max_seq_len=4096, download_dir=None, load_format=auto, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=True, kv_cache_dtype=auto,  device_config=hpu, decoding_config=DecodingConfig(guided_decoding_backend='xgrammar'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=0, served_model_name=Qwen/Qwen2.5-VL-3B-Instruct, num_scheduler_steps=1, multi_step_stream_outputs=True, enable_prefix_caching=False, chunked_prefill_enabled=False, use_async_output_proc=True, disable_mm_preprocessor_cache=False, mm_processor_kwargs=None, pooler_config=None, compilation_config={"splitting_ops":[],"compile_sizes":[],"cudagraph_capture_sizes":[],"max_capture_size":0}, use_cached_outputs=False, 
WARNING 02-27 17:32:01 utils.py:2359] Methods add_prompt_adapter,cache_config,compilation_config,current_platform,list_prompt_adapters,load_config,pin_prompt_adapter,remove_prompt_adapter,scheduler_config not implemented in <vllm.worker.hpu_worker.HPUWorker object at 0x7fba9599ba90>
WARNING 02-27 17:32:01 hpu.py:84] Pin memory is not supported on HPU.
INFO 02-27 17:32:01 hpu.py:35] Using HPUAttention backend.
VLLM_PROMPT_BS_BUCKET_MIN=1 (default:1)
VLLM_PROMPT_BS_BUCKET_STEP=32 (default:32)
VLLM_PROMPT_BS_BUCKET_MAX=2 (default:2)
VLLM_DECODE_BS_BUCKET_MIN=1 (default:1)
VLLM_DECODE_BS_BUCKET_STEP=32 (default:32)
VLLM_DECODE_BS_BUCKET_MAX=2 (default:2)
VLLM_PROMPT_SEQ_BUCKET_MIN=128 (default:128)
VLLM_PROMPT_SEQ_BUCKET_STEP=128 (default:128)
VLLM_PROMPT_SEQ_BUCKET_MAX=1024 (default:1024)
VLLM_DECODE_BLOCK_BUCKET_MIN=128 (default:128)
VLLM_DECODE_BLOCK_BUCKET_STEP=128 (default:128)
VLLM_DECODE_BLOCK_BUCKET_MAX=128 (default:128)
Prompt bucket config (min, step, max_warmup) bs:[1, 32, 2], seq:[128, 128, 1024]
Decode bucket config (min, step, max_warmup) bs:[1, 32, 2], block:[128, 128, 128]
============================= HABANA PT BRIDGE CONFIGURATION =========================== 
 PT_HPU_LAZY_MODE = 1
 PT_RECIPE_CACHE_PATH = 
 PT_CACHE_FOLDER_DELETE = 0
 PT_HPU_RECIPE_CACHE_CONFIG = 
 PT_HPU_MAX_COMPOUND_OP_SIZE = 9223372036854775807
 PT_HPU_LAZY_ACC_PAR_MODE = 1
 PT_HPU_ENABLE_REFINE_DYNAMIC_SHAPES = 0
 PT_HPU_EAGER_PIPELINE_ENABLE = 1
 PT_HPU_EAGER_COLLECTIVE_PIPELINE_ENABLE = 1
---------------------------: System Configuration :---------------------------
Num CPU Cores : 20
CPU RAM       : 113320300 KB
------------------------------------------------------------------------------
INFO 02-27 17:32:05 config.py:2992] cudagraph sizes specified by model runner [] is overridden by config []
Detected flags: [-compile_one_hot -cpu -fp32_softmax +fsdpa -gaudi +gaudi2 -gaudi3]
INFO 02-27 17:32:06 loader.py:423] Loading weights on hpu...
INFO 02-27 17:32:06 weight_utils.py:254] Using model weights format ['*.safetensors']
Loading safetensors checkpoint shards:   0% Completed | 0/2 [00:00<?, ?it/s]
Loading safetensors checkpoint shards:  50% Completed | 1/2 [00:31<00:31, 31.03s/it]
Loading safetensors checkpoint shards: 100% Completed | 2/2 [01:18<00:00, 40.84s/it]
Loading safetensors checkpoint shards: 100% Completed | 2/2 [01:18<00:00, 39.37s/it]
.
.
.
.
.
============================================================================================================================ 12 passed, 173 deselected, 59 warnings in 1558.82s (0:25:58) ==============================================

I will do more testing with image, video and mixed prompts next.
CC: @malkomes @jiminha

Comment thread vllm/model_executor/models/qwen2_5_vl.py Outdated
Comment thread vllm/worker/hpu_model_runner.py Outdated
Comment thread vllm/worker/hpu_model_runner.py Outdated
@malkomes
Copy link
Copy Markdown
Author

thanks for the review, @michalkuligowski
I think I addressed your comments, let me know if I missed anything.

@imangohari1
Copy link
Copy Markdown

@dsocek Adding Daniel to take a look here too.

@jiminha
Copy link
Copy Markdown

jiminha commented Mar 3, 2025

@libinta FYI,

@malkomes malkomes added the New Model Issue o PR to enable a new model label Mar 4, 2025
@malkomes
Copy link
Copy Markdown
Author

malkomes commented Mar 4, 2025

@michalkuligowski any more suggestions? just sync with main and rebased the branch

@@ -0,0 +1 @@
transformers @ git+https://github.com/huggingface/transformers.git@6b550462139655d488d4c663086a63e98713c6b9
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's not add new reuqirement file per model. Why is a specific sha required? I believe this should be added to readme rather.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Qwen2.5-VL is officially supported from Transformer v4.49.0. However currently our VLLM-fork is out of date and support only v4.48.3. v4.48.3 doesn't support qwen2.5-VL though, and the vllm-fork code is also out of date, and can't use 4.49.

 File "/root/tf/qwen/vllm-fork-w2/vllm/model_executor/models/registry.py", line 370, in _raise_for_unsupported
    raise ValueError(
ValueError: Model architectures ['Qwen2_5_VLForConditionalGeneration'] failed to be inspected. Please check the logs for more details.

For now, this specific commit works for qwen2_5_VL without changing too much. Once we update VLLM-fork to the latest, and transformer to 4.49, all of these can go away.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@michalkuligowski FYI: We raised this error on upstream vllm repo, and they mentioned it's bc of the vllm-fork version. vllm-project#12932 (comment)

Comment thread tests/models/registry.py
"Qwen2VLForConditionalGeneration": _HfExamplesInfo("Qwen/Qwen2-VL-2B-Instruct"), # noqa: E501
"Qwen2_5_VLForConditionalGeneration": _HfExamplesInfo("Qwen/Qwen2.5-VL-3B-Instruct", # noqa: E501
min_transformers_version="4.49"), # noqa: E501
min_transformers_version="4.48.9"), # noqa: E501
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this decreased?

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please see the comment above related to transformer version.

from .vision import get_vit_attn_backend

logger = init_logger(__name__)
is_hpu = current_platform.is_hpu()
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is used in one place here, so I think you dont need to save a variable, this will make as little changes to model file as possible

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We also need this for FusedSDPA, will update the code.

Comment thread vllm/worker/hpu_model_runner.py Outdated
return path_to_rope


def make_mrope_positions_tensor_with_pad( \
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please move to utils.py

Comment thread vllm/worker/hpu_model_runner.py Outdated
dtype=torch.long,
device='cpu')
if self.model_is_mrope:
input_positions_tensor = \
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lets not add another variable.
Also can this if-else clause be simplified further?

Comment thread vllm/worker/hpu_model_runner.py Outdated
Comment on lines +1375 to +1389
if self.model_is_mrope:
input_positions = None # type: ignore
else:
input_mrope_positions = None # type: ignore

input_positions = torch.tensor(input_positions
or input_mrope_positions,
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can this be simplified to input_mrope_positions if self.model_is_mrope else input_positions in torch.tensor call?

Copy link
Copy Markdown
Author

@malkomes malkomes Mar 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we mostly follow the CPU code, but I agree that it could be simplified. Just rebase and apply these changes

Comment thread README_GAUDI.md Outdated
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tests were failing. I checked that its due to env issue (Error: clients are not available, please run hlctl kubeconfig)
I clicked rerun failed jobs

@jiminha
Copy link
Copy Markdown

jiminha commented Mar 12, 2025

@kzawora-intel @michalkuligowski I triggered failed cause manually a few times, but still no luck. Anyway that we can run this and have this PR merged?

@michalkuligowski
Copy link
Copy Markdown

Hi @jiminha Failing tests are blocked on SW-218309

@malkomes malkomes changed the base branch from habana_main to qwen2.5-vl-hpu March 17, 2025 23:16
Fails in rotary_embed layer in the view
bypassing it with alternative pt code
else it was editing image_grid_thw to 0,0,0 etc
@malkomes malkomes merged this pull request into HabanaAI:qwen2.5-vl-hpu Mar 17, 2025
malkomes added a commit that referenced this pull request Mar 18, 2025
Initial enablement of Qwen2.5-vl for Gaudi HPU
See #870

Co-authored-by: Mohit Deopujari <mohit.deopujari@intel.com>
Co-authored-by: Jimin Ha <jimin.ha@intel.com>
Co-authored-by: Pallavi Jaini <pallavi.jaini@intel.com>
Co-authored-by: Deepak Narayana <deepak.narayana@intel.com>
Co-authored-by: Sayantan Sarkar <sayantan.sarkar@intel.com>
Co-authored-by: Iman Gohari <s.m.iman.gohari@intel.com>
@malkomes malkomes deleted the qwen2.5-vl-hpu branch March 20, 2025 03:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

New Model Issue o PR to enable a new model

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[New Model]: Qwen2.5-VL

6 participants