Replies: 1 comment
-
|
You ca compile with |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
If you're running llama.cpp with HIP/ROCm on AMD GPUs and using Flash Attention with quantized KV cache, check whether your K and V cache types match.
-ctk q4_0 -ctv q4_0 (symmetric) → fused FA kernel
-ctk q4_0 -ctv f16 (asymmetric) → non fused fallback
The fused path is significantly faster for token generation. The non fused fallback exists for correctness when K/V types differ but it's not optimized the same way.
This is in the source. The fused FA kernels only support matching K/V quantization types. If they don't match, it silently falls back to the slower non fused implementation. No warning, no log message. You just get worse performance and don't know why.
Tested on RX 7900 XTX (gfx1100) with b8642, models ranging from 14B to 27B dense. Symmetric Q4_0 KV consistently used the fast path. Asymmetric Q4_0/F16 did not.
If you've been running -ctk q4_0 -ctv f16 thinking the F16 values give better quality at the cost of some VRAM, the quality tradeoff might be worth it for your use case, but know that you're also paying a speed penalty from missing the fused kernel.
Q8_0/Q8_0 also works as a symmetric config if you want higher KV precision without losing the fused path.
Relevant source: ggml-cuda/fattn*.cu, check the type matching guards.
Beta Was this translation helpful? Give feedback.
All reactions