[Gaudi][Model] Qwen2.5-vl by malkomes · Pull Request #870 · HabanaAI/vllm-fork

malkomes · 2025-02-26T06:38:50Z

Initial enablement of Qwen2.5-vl for Gaudi HPU
Based on vllm-project#12604 it FIXES: vllm-project#12486, vllm-project#12532

Introduce the flag HPU_DISABLE_TENSOR_CACHE to set disable_tensor_cache in htorch.hpu.wrap_in_hpu_graph. It keeps the default value as True for all models but we set it to False for MRoPE models such as Qwen2.5-vl.
Computes MRoPE positions and deltas for the HPU model runner.

Note

Set PT_HPUGRAPH_DISABLE_TENSOR_CACHE=false to run qwen models, see README_GAUDI.
To install the VLLM with qwen2.5-VL enabled:

pip install -r requirements-hpu.txt; pip install -r requirements-hpu-qwen2_5_vl.txt ; python setup.py develop

--
Co-authored-by: Mohit Deopujari mohit.deopujari@intel.com
Co-authored-by: Jimin Ha jimin.ha@intel.com
Co-authored-by: Pallavi Jaini pallavi.jaini@intel.com
Co-authored-by: Deepak Narayana deepak.narayana@intel.com
Co-authored-by: Sayantan Sarkar sayantan.sarkar@intel.com
Co-authored-by: Iman Gohari s.m.iman.gohari@intel.com

imangohari1 · 2025-02-27T18:16:49Z

I have clean cloned this branch and tested the qwen2.5-vl pytests.
all 12 tests pass. below are the details.

$ pip install -r requirements-hpu.txt; pip install -r requirements-hpu-qwen2_5_vl.txt ; python setup.py develop
$ VLLM_SKIP_WARMUP=true pytest tests/models/decoder_only/vision_language/test_models.py -s -v -k "[qwen2_5"

INFO 02-27 17:31:46 __init__.py:199] Automatically detected platform hpu.
================================================================================================================================================ test session starts =================================================================================================================================================
platform linux -- Python 3.10.12, pytest-8.3.4, pluggy-1.5.0 -- /usr/bin/python
cachedir: .pytest_cache
rootdir: /devops/sgohari/tests/jira/hs-4927/pr/vllm-fork
configfile: pyproject.toml
plugins: anyio-4.8.0, typeguard-4.3.0
collected 185 items / 173 deselected / 12 selected                                                                                                                                                                                                                                                                   

tests/models/decoder_only/vision_language/test_models.py::test_single_image_models[qwen2_5_vl-test_case28] INFO 02-27 17:31:59 config.py:548] This model supports multiple tasks: {'generate', 'embed', 'score', 'classify', 'reward'}. Defaulting to 'generate'.
INFO 02-27 17:31:59 llm_engine.py:234] Initializing a V0 LLM engine (v0.1.dev5293+gff97945) with config: model='Qwen/Qwen2.5-VL-3B-Instruct', speculative_config=None, tokenizer='Qwen/Qwen2.5-VL-3B-Instruct', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.bfloat16, max_seq_len=4096, download_dir=None, load_format=auto, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=True, kv_cache_dtype=auto,  device_config=hpu, decoding_config=DecodingConfig(guided_decoding_backend='xgrammar'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=0, served_model_name=Qwen/Qwen2.5-VL-3B-Instruct, num_scheduler_steps=1, multi_step_stream_outputs=True, enable_prefix_caching=False, chunked_prefill_enabled=False, use_async_output_proc=True, disable_mm_preprocessor_cache=False, mm_processor_kwargs=None, pooler_config=None, compilation_config={"splitting_ops":[],"compile_sizes":[],"cudagraph_capture_sizes":[],"max_capture_size":0}, use_cached_outputs=False, 
WARNING 02-27 17:32:01 utils.py:2359] Methods add_prompt_adapter,cache_config,compilation_config,current_platform,list_prompt_adapters,load_config,pin_prompt_adapter,remove_prompt_adapter,scheduler_config not implemented in <vllm.worker.hpu_worker.HPUWorker object at 0x7fba9599ba90>
WARNING 02-27 17:32:01 hpu.py:84] Pin memory is not supported on HPU.
INFO 02-27 17:32:01 hpu.py:35] Using HPUAttention backend.
VLLM_PROMPT_BS_BUCKET_MIN=1 (default:1)
VLLM_PROMPT_BS_BUCKET_STEP=32 (default:32)
VLLM_PROMPT_BS_BUCKET_MAX=2 (default:2)
VLLM_DECODE_BS_BUCKET_MIN=1 (default:1)
VLLM_DECODE_BS_BUCKET_STEP=32 (default:32)
VLLM_DECODE_BS_BUCKET_MAX=2 (default:2)
VLLM_PROMPT_SEQ_BUCKET_MIN=128 (default:128)
VLLM_PROMPT_SEQ_BUCKET_STEP=128 (default:128)
VLLM_PROMPT_SEQ_BUCKET_MAX=1024 (default:1024)
VLLM_DECODE_BLOCK_BUCKET_MIN=128 (default:128)
VLLM_DECODE_BLOCK_BUCKET_STEP=128 (default:128)
VLLM_DECODE_BLOCK_BUCKET_MAX=128 (default:128)
Prompt bucket config (min, step, max_warmup) bs:[1, 32, 2], seq:[128, 128, 1024]
Decode bucket config (min, step, max_warmup) bs:[1, 32, 2], block:[128, 128, 128]
============================= HABANA PT BRIDGE CONFIGURATION =========================== 
 PT_HPU_LAZY_MODE = 1
 PT_RECIPE_CACHE_PATH = 
 PT_CACHE_FOLDER_DELETE = 0
 PT_HPU_RECIPE_CACHE_CONFIG = 
 PT_HPU_MAX_COMPOUND_OP_SIZE = 9223372036854775807
 PT_HPU_LAZY_ACC_PAR_MODE = 1
 PT_HPU_ENABLE_REFINE_DYNAMIC_SHAPES = 0
 PT_HPU_EAGER_PIPELINE_ENABLE = 1
 PT_HPU_EAGER_COLLECTIVE_PIPELINE_ENABLE = 1
---------------------------: System Configuration :---------------------------
Num CPU Cores : 20
CPU RAM       : 113320300 KB
------------------------------------------------------------------------------
INFO 02-27 17:32:05 config.py:2992] cudagraph sizes specified by model runner [] is overridden by config []
Detected flags: [-compile_one_hot -cpu -fp32_softmax +fsdpa -gaudi +gaudi2 -gaudi3]
INFO 02-27 17:32:06 loader.py:423] Loading weights on hpu...
INFO 02-27 17:32:06 weight_utils.py:254] Using model weights format ['*.safetensors']
Loading safetensors checkpoint shards:   0% Completed | 0/2 [00:00<?, ?it/s]
Loading safetensors checkpoint shards:  50% Completed | 1/2 [00:31<00:31, 31.03s/it]
Loading safetensors checkpoint shards: 100% Completed | 2/2 [01:18<00:00, 40.84s/it]
Loading safetensors checkpoint shards: 100% Completed | 2/2 [01:18<00:00, 39.37s/it]
.
.
.
.
.
============================================================================================================================ 12 passed, 173 deselected, 59 warnings in 1558.82s (0:25:58) ==============================================

I will do more testing with image, video and mixed prompts next.
CC: @malkomes @jiminha

malkomes · 2025-02-28T17:01:02Z

thanks for the review, @michalkuligowski
I think I addressed your comments, let me know if I missed anything.

imangohari1 · 2025-02-28T18:14:03Z

@dsocek Adding Daniel to take a look here too.

jiminha · 2025-03-03T19:45:53Z

@libinta FYI,

malkomes · 2025-03-04T19:34:39Z

@michalkuligowski any more suggestions? just sync with main and rebased the branch

michalkuligowski · 2025-03-07T08:13:18Z

@@ -0,0 +1 @@
+transformers @ git+https://github.com/huggingface/transformers.git@6b550462139655d488d4c663086a63e98713c6b9


Let's not add new reuqirement file per model. Why is a specific sha required? I believe this should be added to readme rather.

Qwen2.5-VL is officially supported from Transformer v4.49.0. However currently our VLLM-fork is out of date and support only v4.48.3. v4.48.3 doesn't support qwen2.5-VL though, and the vllm-fork code is also out of date, and can't use 4.49.

File "/root/tf/qwen/vllm-fork-w2/vllm/model_executor/models/registry.py", line 370, in _raise_for_unsupported raise ValueError( ValueError: Model architectures ['Qwen2_5_VLForConditionalGeneration'] failed to be inspected. Please check the logs for more details.

For now, this specific commit works for qwen2_5_VL without changing too much. Once we update VLLM-fork to the latest, and transformer to 4.49, all of these can go away.

@michalkuligowski FYI: We raised this error on upstream vllm repo, and they mentioned it's bc of the vllm-fork version. vllm-project#12932 (comment)

michalkuligowski · 2025-03-07T08:13:53Z

    "Qwen2VLForConditionalGeneration": _HfExamplesInfo("Qwen/Qwen2-VL-2B-Instruct"),  # noqa: E501
    "Qwen2_5_VLForConditionalGeneration": _HfExamplesInfo("Qwen/Qwen2.5-VL-3B-Instruct",  # noqa: E501
-                                                          min_transformers_version="4.49"),  # noqa: E501
+                                                          min_transformers_version="4.48.9"),  # noqa: E501


Why is this decreased?

Please see the comment above related to transformer version.

michalkuligowski · 2025-03-07T08:15:18Z

 from .vision import get_vit_attn_backend

 logger = init_logger(__name__)
+is_hpu = current_platform.is_hpu()


This is used in one place here, so I think you dont need to save a variable, this will make as little changes to model file as possible

We also need this for FusedSDPA, will update the code.

michalkuligowski · 2025-03-07T08:28:18Z

    return path_to_rope


+def make_mrope_positions_tensor_with_pad( \


Please move to utils.py

michalkuligowski · 2025-03-07T08:30:24Z

-                                               dtype=torch.long,
-                                               device='cpu')
+        if self.model_is_mrope:
+            input_positions_tensor = \


Lets not add another variable.
Also can this if-else clause be simplified further?

michalkuligowski · 2025-03-07T08:34:38Z

+        if self.model_is_mrope:
+            input_positions = None  # type: ignore
+        else:
+            input_mrope_positions = None  # type: ignore
+
+        input_positions = torch.tensor(input_positions
+                                       or input_mrope_positions,


Can this be simplified to input_mrope_positions if self.model_is_mrope else input_positions in torch.tensor call?

we mostly follow the CPU code, but I agree that it could be simplified. Just rebase and apply these changes

michalkuligowski · 2025-03-11T09:16:18Z

Tests were failing. I checked that its due to env issue (Error: clients are not available, please run hlctl kubeconfig)
I clicked rerun failed jobs

jiminha · 2025-03-12T19:06:01Z

@kzawora-intel @michalkuligowski I triggered failed cause manually a few times, but still no luck. Anyway that we can run this and have this PR merged?

michalkuligowski · 2025-03-12T21:02:35Z

Hi @jiminha Failing tests are blocked on SW-218309

Fails in rotary_embed layer in the view

bypassing it with alternative pt code else it was editing image_grid_thw to 0,0,0 etc

Co-authored-by: Mohit Deopujari mohit.deopujari@intel.com Co-authored-by: Jimin Ha jimin.ha@intel.com Co-authored-by: Pallavi Jaini pallavi.jaini@intel.com Co-authored-by: Deepak Narayana deepak.narayana@intel.com Co-authored-by: Sayantan Sarkar sayantan.sarkar@intel.com Co-authored-by: Gustavo Malkomes gustavo.malkomes@intel.com

Initial enablement of Qwen2.5-vl for Gaudi HPU See #870 Co-authored-by: Mohit Deopujari <mohit.deopujari@intel.com> Co-authored-by: Jimin Ha <jimin.ha@intel.com> Co-authored-by: Pallavi Jaini <pallavi.jaini@intel.com> Co-authored-by: Deepak Narayana <deepak.narayana@intel.com> Co-authored-by: Sayantan Sarkar <sayantan.sarkar@intel.com> Co-authored-by: Iman Gohari <s.m.iman.gohari@intel.com>

malkomes marked this pull request as ready for review February 27, 2025 06:13

malkomes requested review from afierka-intel, kzawora-intel, madamczyk-intel, mgawarkiewicz, michalkuligowski and vivekgoe as code owners February 27, 2025 06:13

malkomes force-pushed the qwen2.5-vl-hpu branch from 0a20064 to ff97945 Compare February 27, 2025 06:14

michalkuligowski requested changes Feb 27, 2025

View reviewed changes

Comment thread requirements-hpu-qwen2_5_vl.txt

michalkuligowski requested changes Feb 28, 2025

View reviewed changes

Comment thread vllm/model_executor/models/qwen2_5_vl.py Outdated

Comment thread vllm/worker/hpu_model_runner.py Outdated

Comment thread vllm/worker/hpu_model_runner.py Outdated

michalkuligowski force-pushed the qwen2.5-vl-hpu branch from d4a8289 to 744875c Compare February 28, 2025 13:42

malkomes force-pushed the qwen2.5-vl-hpu branch from 34af355 to ccff671 Compare March 4, 2025 17:29

malkomes added the New Model Issue o PR to enable a new model label Mar 4, 2025

malkomes requested a review from michalkuligowski March 6, 2025 15:37

michalkuligowski requested changes Mar 7, 2025

View reviewed changes

malkomes force-pushed the qwen2.5-vl-hpu branch from 3e2f0da to 5baa1ed Compare March 10, 2025 16:46

michalkuligowski approved these changes Mar 11, 2025

View reviewed changes

michalkuligowski reviewed Mar 11, 2025

View reviewed changes

malkomes changed the base branch from habana_main to qwen2.5-vl-hpu March 17, 2025 23:16

ssarkar2 added 4 commits March 17, 2025 23:24

Initial commit

8221cd3

Fails in rotary_embed layer in the view

Comments to trace execution diff between cpu/hpu

63a93de

minor

9145fec

_validate_and_reshape_mm_tensor looks buggy...

f36fffd

bypassing it with alternative pt code else it was editing image_grid_thw to 0,0,0 etc

malkomes and others added 24 commits March 17, 2025 23:24

small changes to work with llama-3.2-vl

0d953e6

skip profile_run for now

6b74112

reshape positions in MRotaryEmbedding for HPU

d639d16

input positions [3, seq_len] or [seq_len,] for Qwen2.5vl

1edfcac

fix the decoder

e293fda

comment prints

d3c6059

cleanup

baae2c3

polishing

8678787

add type ignore

8a01615

set HPU_DISABLE_TENSOR_CACHE to false for Qwen2.5vl

5bc55b3

make lint happy?

2870a77

Change torch dtype to bflat16 for qwen2.5-VL test

9f99da0

add check_transformers to qwen2_5_VL

af39eba

improving code and comments

c916f89

lint

62d7c8f

remove Optinal

9d169be

lint qwen2_5_vl

1e8901e

add reviewers suggestions

e299605

lint

e6d1f57

remove blank line

6ec3c98

input_mrope_positions if/else simplifications

fb068ca

Enable FusedSDPA for Qwen2.5 VL

d3a96db

Lint fix

145adb2

malkomes force-pushed the qwen2.5-vl-hpu branch from 100827b to 145adb2 Compare March 17, 2025 23:24

malkomes merged this pull request into HabanaAI:qwen2.5-vl-hpu Mar 17, 2025

jiminha mentioned this pull request Mar 18, 2025

[Gaudi][Model] Qwen2.5-vl #923

Merged

malkomes deleted the qwen2.5-vl-hpu branch March 20, 2025 03:08

		@@ -0,0 +1 @@
		transformers @ git+https://github.com/huggingface/transformers.git@6b550462139655d488d4c663086a63e98713c6b9

		return path_to_rope


		def make_mrope_positions_tensor_with_pad( \

Conversation

malkomes commented Feb 26, 2025 • edited by github-actions Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

imangohari1 commented Feb 27, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

malkomes commented Feb 28, 2025

Uh oh!

imangohari1 commented Feb 28, 2025

Uh oh!

jiminha commented Mar 3, 2025

Uh oh!

malkomes commented Mar 4, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

malkomes Mar 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jiminha commented Mar 12, 2025

Uh oh!

michalkuligowski commented Mar 12, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

malkomes commented Feb 26, 2025 •

edited by github-actions Bot

Loading

malkomes Mar 10, 2025 •

edited

Loading