Skip to content

Support group_size=64 in HybridW4A16 and wvSplitK_int4_g#905

Merged
mgehre-amd merged 1 commit intogfx11from
matthias.fix-group-size-64
May 4, 2026
Merged

Support group_size=64 in HybridW4A16 and wvSplitK_int4_g#905
mgehre-amd merged 1 commit intogfx11from
matthias.fix-group-size-64

Conversation

@mgehre-amd
Copy link
Copy Markdown

@mgehre-amd mgehre-amd commented Apr 27, 2026

The HIP wvSplitK_int4_g C++ kernel only supported group_size 32 and 128, but HybridW4A16LinearKernel accepted 32, 64, 128, and 256. When a model using group_size=64 (e.g. RedHatAI/Qwen3-1.7B-quantized.w4a16) hit the decode path, the C++ kernel rejected it at runtime.

The kernel template already handles arbitrary group sizes that are multiples of A_CHUNK (16), so the fix extends the TORCH_CHECK and the WVSPLIT_INT4G_GS dispatch macro to include 64. SUPPORTED_GROUP_SIZES is narrowed to [32, 64, 128] so there is no mismatch between what can_implement accepts and what the C++ kernel supports.

Build time impact: skinny_gemms_int4.hip.o compile time increases from 158s to 233s (+47%) due to the additional template instantiations for group_size=64.

@mgehre-amd mgehre-amd requested a review from gshtras as a code owner April 27, 2026 12:33
@mgehre-amd mgehre-amd removed the request for review from gshtras April 27, 2026 12:33
Copy link
Copy Markdown

@roberteg16 roberteg16 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Comment thread csrc/rocm/skinny_gemms_int4.cu Outdated
Comment thread csrc/rocm/skinny_gemms_int4.cu
Comment thread csrc/rocm/skinny_gemms_int4.cu Outdated
Comment thread csrc/rocm/skinny_gemms_int4.cu
Comment thread csrc/rocm/skinny_gemms_int4.cu Outdated
The HIP wvSplitK_int4_g C++ kernel only supported group_size 32 and 128,
but HybridW4A16LinearKernel accepted 32, 64, 128, and 256. When a model
using group_size=64 (e.g. RedHatAI/Qwen3-1.7B-quantized.w4a16) hit the
decode path, the C++ kernel rejected it at runtime.

The kernel template already handles arbitrary group sizes that are
multiples of A_CHUNK (16), so the fix extends the TORCH_CHECK and the
WVSPLIT_INT4G_GS dispatch macro to include 64. SUPPORTED_GROUP_SIZES
is narrowed to [32, 64, 128] so there is no mismatch between
what can_implement accepts and what the C++ kernel supports.

Build time impact: skinny_gemms_int4.hip.o compile time increases from
158s to 233s (+47%) due to the additional template instantiations for
group_size=64.

Signed-off-by: Matthias Gehre <matthias.gehre@amd.com>
@mgehre-amd mgehre-amd force-pushed the matthias.fix-group-size-64 branch from 6804683 to ab089fb Compare May 4, 2026 06:07
@mgehre-amd mgehre-amd merged commit 83d5d47 into gfx11 May 4, 2026
3 of 4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants