Perfomance: CUDA turbo4/turbo4 shows severe generation slowdown at large contexts #22258
Closed
TheLanozavr
started this conversation in
General
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
Short: turbo4 on RTX 3090 becomes very slow with prompt size.
Details.
Setup: NVIDIA GeForce GTX 1050 Ti connected to display, NVIDIA GeForce RTX 3090 (24G) without display only for llama.cpp; Driver Version: 560.94 (1050 does not works with latest); CUDA Version: 12.6; Git bash shell.
Built commit: 59798f1
Model: unsloth/Qwen3.5-27B-GGUF:Q5_K_S
Server command:
TURBO_LAYER_ADAPTIVE=0 ./build/bin/Release/llama-server.exe -hf unsloth/Qwen3.5-27B-GGUF:Q5_K_S --device CUDA0 -ngl 99 -np 1 --flash-attn on --cache-type-k q4_0 --cache-type-v q4_0 --host 0.0.0.0 --offline --no-mmproj-offload -c 262144(with different cache types)Test procedure: I created python script which itterate over promt sizes from 1k to 250k. For each size value, it runs 3 stress tests that measure how fast the model can process increasingly large contexts. Each test creates a "needle in haystack" scenario where a secret passphrase is hidden at the start of a large text, and the model must retrieve it from the end. The performance values come from the API response, not calculated by the script. When you call the completion endpoint, llama.cpp returns timing data: timings.prompt_per_second and timings.predicted_per_second (also actual tokens_evaluated logged).
Tested kv cache type combinations: q8_0/q8_0 (max context 158k); q4_0/q4_0 (max context 262k); turbo4/turbo4 (max context 262k); turbo3/turbo3 (max context 262k); q8_0/turbo4 (max context 201k)
Prompt processing speed: starting from 1200 tok/sec drops to 515 tok/sec at 250k prompt size. Same for all cache types.
Prompt generation:
Note: q8_0/q4_0 uses all cpu cores and eval very slow ~150tok/sec. I tried original llama.cpp - result is the same.
Is there any chance for turbo4 to compare in speed with q8 or q4 ?
Raw results: speed.zip
Beta Was this translation helpful? Give feedback.
All reactions