Support group_size=64 in HybridW4A16 and wvSplitK_int4_g by mgehre-amd · Pull Request #905 · ROCm/vllm

mgehre-amd · 2026-04-27T12:33:34Z

The HIP wvSplitK_int4_g C++ kernel only supported group_size 32 and 128, but HybridW4A16LinearKernel accepted 32, 64, 128, and 256. When a model using group_size=64 (e.g. RedHatAI/Qwen3-1.7B-quantized.w4a16) hit the decode path, the C++ kernel rejected it at runtime.

The kernel template already handles arbitrary group sizes that are multiples of A_CHUNK (16), so the fix extends the TORCH_CHECK and the WVSPLIT_INT4G_GS dispatch macro to include 64. SUPPORTED_GROUP_SIZES is narrowed to [32, 64, 128] so there is no mismatch between what can_implement accepts and what the C++ kernel supports.

Build time impact: skinny_gemms_int4.hip.o compile time increases from 158s to 233s (+47%) due to the additional template instantiations for group_size=64.

roberteg16

LGTM

The HIP wvSplitK_int4_g C++ kernel only supported group_size 32 and 128, but HybridW4A16LinearKernel accepted 32, 64, 128, and 256. When a model using group_size=64 (e.g. RedHatAI/Qwen3-1.7B-quantized.w4a16) hit the decode path, the C++ kernel rejected it at runtime. The kernel template already handles arbitrary group sizes that are multiples of A_CHUNK (16), so the fix extends the TORCH_CHECK and the WVSPLIT_INT4G_GS dispatch macro to include 64. SUPPORTED_GROUP_SIZES is narrowed to [32, 64, 128] so there is no mismatch between what can_implement accepts and what the C++ kernel supports. Build time impact: skinny_gemms_int4.hip.o compile time increases from 158s to 233s (+47%) due to the additional template instantiations for group_size=64. Signed-off-by: Matthias Gehre <matthias.gehre@amd.com>

mgehre-amd requested review from eble-amd and roberteg16 April 27, 2026 12:33

mgehre-amd requested a review from gshtras as a code owner April 27, 2026 12:33

mgehre-amd removed the request for review from gshtras April 27, 2026 12:33

roberteg16 approved these changes Apr 27, 2026

View reviewed changes

eble-amd reviewed Apr 27, 2026

View reviewed changes

Comment thread csrc/rocm/skinny_gemms_int4.cu Outdated

Comment thread csrc/rocm/skinny_gemms_int4.cu

Comment thread csrc/rocm/skinny_gemms_int4.cu Outdated

Comment thread csrc/rocm/skinny_gemms_int4.cu

Comment thread csrc/rocm/skinny_gemms_int4.cu Outdated

eble-amd approved these changes Apr 28, 2026

View reviewed changes

mgehre-amd force-pushed the matthias.fix-group-size-64 branch from 6804683 to ab089fb Compare May 4, 2026 06:07

mgehre-amd merged commit 83d5d47 into gfx11 May 4, 2026
3 of 4 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support group_size=64 in HybridW4A16 and wvSplitK_int4_g#905

Support group_size=64 in HybridW4A16 and wvSplitK_int4_g#905
mgehre-amd merged 1 commit intogfx11from
matthias.fix-group-size-64

mgehre-amd commented Apr 27, 2026 •

edited by github-actions Bot

Loading

Uh oh!

roberteg16 left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

mgehre-amd commented Apr 27, 2026 • edited by github-actions Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

roberteg16 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

mgehre-amd commented Apr 27, 2026 •

edited by github-actions Bot

Loading