[Gaudi][Model] Qwen2.5-vl #870
Conversation
0a20064 to
ff97945
Compare
|
I have clean cloned this branch and tested the qwen2.5-vl pytests. $ pip install -r requirements-hpu.txt; pip install -r requirements-hpu-qwen2_5_vl.txt ; python setup.py develop
$ VLLM_SKIP_WARMUP=true pytest tests/models/decoder_only/vision_language/test_models.py -s -v -k "[qwen2_5"INFO 02-27 17:31:46 __init__.py:199] Automatically detected platform hpu.
================================================================================================================================================ test session starts =================================================================================================================================================
platform linux -- Python 3.10.12, pytest-8.3.4, pluggy-1.5.0 -- /usr/bin/python
cachedir: .pytest_cache
rootdir: /devops/sgohari/tests/jira/hs-4927/pr/vllm-fork
configfile: pyproject.toml
plugins: anyio-4.8.0, typeguard-4.3.0
collected 185 items / 173 deselected / 12 selected
tests/models/decoder_only/vision_language/test_models.py::test_single_image_models[qwen2_5_vl-test_case28] INFO 02-27 17:31:59 config.py:548] This model supports multiple tasks: {'generate', 'embed', 'score', 'classify', 'reward'}. Defaulting to 'generate'.
INFO 02-27 17:31:59 llm_engine.py:234] Initializing a V0 LLM engine (v0.1.dev5293+gff97945) with config: model='Qwen/Qwen2.5-VL-3B-Instruct', speculative_config=None, tokenizer='Qwen/Qwen2.5-VL-3B-Instruct', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.bfloat16, max_seq_len=4096, download_dir=None, load_format=auto, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=True, kv_cache_dtype=auto, device_config=hpu, decoding_config=DecodingConfig(guided_decoding_backend='xgrammar'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=0, served_model_name=Qwen/Qwen2.5-VL-3B-Instruct, num_scheduler_steps=1, multi_step_stream_outputs=True, enable_prefix_caching=False, chunked_prefill_enabled=False, use_async_output_proc=True, disable_mm_preprocessor_cache=False, mm_processor_kwargs=None, pooler_config=None, compilation_config={"splitting_ops":[],"compile_sizes":[],"cudagraph_capture_sizes":[],"max_capture_size":0}, use_cached_outputs=False,
WARNING 02-27 17:32:01 utils.py:2359] Methods add_prompt_adapter,cache_config,compilation_config,current_platform,list_prompt_adapters,load_config,pin_prompt_adapter,remove_prompt_adapter,scheduler_config not implemented in <vllm.worker.hpu_worker.HPUWorker object at 0x7fba9599ba90>
WARNING 02-27 17:32:01 hpu.py:84] Pin memory is not supported on HPU.
INFO 02-27 17:32:01 hpu.py:35] Using HPUAttention backend.
VLLM_PROMPT_BS_BUCKET_MIN=1 (default:1)
VLLM_PROMPT_BS_BUCKET_STEP=32 (default:32)
VLLM_PROMPT_BS_BUCKET_MAX=2 (default:2)
VLLM_DECODE_BS_BUCKET_MIN=1 (default:1)
VLLM_DECODE_BS_BUCKET_STEP=32 (default:32)
VLLM_DECODE_BS_BUCKET_MAX=2 (default:2)
VLLM_PROMPT_SEQ_BUCKET_MIN=128 (default:128)
VLLM_PROMPT_SEQ_BUCKET_STEP=128 (default:128)
VLLM_PROMPT_SEQ_BUCKET_MAX=1024 (default:1024)
VLLM_DECODE_BLOCK_BUCKET_MIN=128 (default:128)
VLLM_DECODE_BLOCK_BUCKET_STEP=128 (default:128)
VLLM_DECODE_BLOCK_BUCKET_MAX=128 (default:128)
Prompt bucket config (min, step, max_warmup) bs:[1, 32, 2], seq:[128, 128, 1024]
Decode bucket config (min, step, max_warmup) bs:[1, 32, 2], block:[128, 128, 128]
============================= HABANA PT BRIDGE CONFIGURATION ===========================
PT_HPU_LAZY_MODE = 1
PT_RECIPE_CACHE_PATH =
PT_CACHE_FOLDER_DELETE = 0
PT_HPU_RECIPE_CACHE_CONFIG =
PT_HPU_MAX_COMPOUND_OP_SIZE = 9223372036854775807
PT_HPU_LAZY_ACC_PAR_MODE = 1
PT_HPU_ENABLE_REFINE_DYNAMIC_SHAPES = 0
PT_HPU_EAGER_PIPELINE_ENABLE = 1
PT_HPU_EAGER_COLLECTIVE_PIPELINE_ENABLE = 1
---------------------------: System Configuration :---------------------------
Num CPU Cores : 20
CPU RAM : 113320300 KB
------------------------------------------------------------------------------
INFO 02-27 17:32:05 config.py:2992] cudagraph sizes specified by model runner [] is overridden by config []
Detected flags: [-compile_one_hot -cpu -fp32_softmax +fsdpa -gaudi +gaudi2 -gaudi3]
INFO 02-27 17:32:06 loader.py:423] Loading weights on hpu...
INFO 02-27 17:32:06 weight_utils.py:254] Using model weights format ['*.safetensors']
Loading safetensors checkpoint shards: 0% Completed | 0/2 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 50% Completed | 1/2 [00:31<00:31, 31.03s/it]
Loading safetensors checkpoint shards: 100% Completed | 2/2 [01:18<00:00, 40.84s/it]
Loading safetensors checkpoint shards: 100% Completed | 2/2 [01:18<00:00, 39.37s/it]
.
.
.
.
.
============================================================================================================================ 12 passed, 173 deselected, 59 warnings in 1558.82s (0:25:58) ==============================================I will do more testing with image, video and mixed prompts next. |
d4a8289 to
744875c
Compare
|
thanks for the review, @michalkuligowski |
|
@dsocek Adding Daniel to take a look here too. |
|
@libinta FYI, |
|
@michalkuligowski any more suggestions? just sync with main and rebased the branch |
| @@ -0,0 +1 @@ | |||
| transformers @ git+https://github.com/huggingface/transformers.git@6b550462139655d488d4c663086a63e98713c6b9 | |||
There was a problem hiding this comment.
Let's not add new reuqirement file per model. Why is a specific sha required? I believe this should be added to readme rather.
There was a problem hiding this comment.
Qwen2.5-VL is officially supported from Transformer v4.49.0. However currently our VLLM-fork is out of date and support only v4.48.3. v4.48.3 doesn't support qwen2.5-VL though, and the vllm-fork code is also out of date, and can't use 4.49.
File "/root/tf/qwen/vllm-fork-w2/vllm/model_executor/models/registry.py", line 370, in _raise_for_unsupported
raise ValueError(
ValueError: Model architectures ['Qwen2_5_VLForConditionalGeneration'] failed to be inspected. Please check the logs for more details.
For now, this specific commit works for qwen2_5_VL without changing too much. Once we update VLLM-fork to the latest, and transformer to 4.49, all of these can go away.
There was a problem hiding this comment.
@michalkuligowski FYI: We raised this error on upstream vllm repo, and they mentioned it's bc of the vllm-fork version. vllm-project#12932 (comment)
| "Qwen2VLForConditionalGeneration": _HfExamplesInfo("Qwen/Qwen2-VL-2B-Instruct"), # noqa: E501 | ||
| "Qwen2_5_VLForConditionalGeneration": _HfExamplesInfo("Qwen/Qwen2.5-VL-3B-Instruct", # noqa: E501 | ||
| min_transformers_version="4.49"), # noqa: E501 | ||
| min_transformers_version="4.48.9"), # noqa: E501 |
There was a problem hiding this comment.
Please see the comment above related to transformer version.
| from .vision import get_vit_attn_backend | ||
|
|
||
| logger = init_logger(__name__) | ||
| is_hpu = current_platform.is_hpu() |
There was a problem hiding this comment.
This is used in one place here, so I think you dont need to save a variable, this will make as little changes to model file as possible
There was a problem hiding this comment.
We also need this for FusedSDPA, will update the code.
| return path_to_rope | ||
|
|
||
|
|
||
| def make_mrope_positions_tensor_with_pad( \ |
| dtype=torch.long, | ||
| device='cpu') | ||
| if self.model_is_mrope: | ||
| input_positions_tensor = \ |
There was a problem hiding this comment.
Lets not add another variable.
Also can this if-else clause be simplified further?
| if self.model_is_mrope: | ||
| input_positions = None # type: ignore | ||
| else: | ||
| input_mrope_positions = None # type: ignore | ||
|
|
||
| input_positions = torch.tensor(input_positions | ||
| or input_mrope_positions, |
There was a problem hiding this comment.
Can this be simplified to input_mrope_positions if self.model_is_mrope else input_positions in torch.tensor call?
There was a problem hiding this comment.
we mostly follow the CPU code, but I agree that it could be simplified. Just rebase and apply these changes
3e2f0da to
5baa1ed
Compare
There was a problem hiding this comment.
Tests were failing. I checked that its due to env issue (Error: clients are not available, please run hlctl kubeconfig)
I clicked rerun failed jobs
|
@kzawora-intel @michalkuligowski I triggered failed cause manually a few times, but still no luck. Anyway that we can run this and have this PR merged? |
|
Hi @jiminha Failing tests are blocked on SW-218309 |
Fails in rotary_embed layer in the view
bypassing it with alternative pt code else it was editing image_grid_thw to 0,0,0 etc
Co-authored-by: Mohit Deopujari mohit.deopujari@intel.com Co-authored-by: Jimin Ha jimin.ha@intel.com Co-authored-by: Pallavi Jaini pallavi.jaini@intel.com Co-authored-by: Deepak Narayana deepak.narayana@intel.com Co-authored-by: Sayantan Sarkar sayantan.sarkar@intel.com Co-authored-by: Gustavo Malkomes gustavo.malkomes@intel.com
100827b to
145adb2
Compare
Initial enablement of Qwen2.5-vl for Gaudi HPU See #870 Co-authored-by: Mohit Deopujari <mohit.deopujari@intel.com> Co-authored-by: Jimin Ha <jimin.ha@intel.com> Co-authored-by: Pallavi Jaini <pallavi.jaini@intel.com> Co-authored-by: Deepak Narayana <deepak.narayana@intel.com> Co-authored-by: Sayantan Sarkar <sayantan.sarkar@intel.com> Co-authored-by: Iman Gohari <s.m.iman.gohari@intel.com>
Initial enablement of Qwen2.5-vl for Gaudi HPU
Based on vllm-project#12604 it FIXES: vllm-project#12486, vllm-project#12532
HPU_DISABLE_TENSOR_CACHEto setdisable_tensor_cacheinhtorch.hpu.wrap_in_hpu_graph. It keeps the default value asTruefor all models but we set it toFalsefor MRoPE models such as Qwen2.5-vl.Note
Set
PT_HPUGRAPH_DISABLE_TENSOR_CACHE=falseto run qwen models, see README_GAUDI.To install the VLLM with qwen2.5-VL enabled:
--
Co-authored-by: Mohit Deopujari mohit.deopujari@intel.com
Co-authored-by: Jimin Ha jimin.ha@intel.com
Co-authored-by: Pallavi Jaini pallavi.jaini@intel.com
Co-authored-by: Deepak Narayana deepak.narayana@intel.com
Co-authored-by: Sayantan Sarkar sayantan.sarkar@intel.com
Co-authored-by: Iman Gohari s.m.iman.gohari@intel.com