Perfomance: CUDA turbo4/turbo4 shows severe generation slowdown at large contexts #22258

TheLanozavr · 2026-04-22T18:20:29Z

TheLanozavr
Apr 22, 2026

Short: turbo4 on RTX 3090 becomes very slow with prompt size.

Details.

Setup: NVIDIA GeForce GTX 1050 Ti connected to display, NVIDIA GeForce RTX 3090 (24G) without display only for llama.cpp; Driver Version: 560.94 (1050 does not works with latest); CUDA Version: 12.6; Git bash shell.

Built commit: 59798f1

Model: unsloth/Qwen3.5-27B-GGUF:Q5_K_S

Server command: TURBO_LAYER_ADAPTIVE=0 ./build/bin/Release/llama-server.exe -hf unsloth/Qwen3.5-27B-GGUF:Q5_K_S --device CUDA0 -ngl 99 -np 1 --flash-attn on --cache-type-k q4_0 --cache-type-v q4_0 --host 0.0.0.0 --offline --no-mmproj-offload -c 262144 (with different cache types)

Test procedure: I created python script which itterate over promt sizes from 1k to 250k. For each size value, it runs 3 stress tests that measure how fast the model can process increasingly large contexts. Each test creates a "needle in haystack" scenario where a secret passphrase is hidden at the start of a large text, and the model must retrieve it from the end. The performance values come from the API response, not calculated by the script. When you call the completion endpoint, llama.cpp returns timing data: timings.prompt_per_second and timings.predicted_per_second (also actual tokens_evaluated logged).

Tested kv cache type combinations: q8_0/q8_0 (max context 158k); q4_0/q4_0 (max context 262k); turbo4/turbo4 (max context 262k); turbo3/turbo3 (max context 262k); q8_0/turbo4 (max context 201k)

Prompt processing speed: starting from 1200 tok/sec drops to 515 tok/sec at 250k prompt size. Same for all cache types.

Prompt generation:

same for q8/q8 and q4/4: from 36 tok/sec to 11 tok/sec.
t4/t4 drops very fast and ends with 4 tok/sec at the 250k
same for t3/t3 and q8/t4 but between first two and ends with 5.5 tok/sec at the 250k

Note: q8_0/q4_0 uses all cpu cores and eval very slow ~150tok/sec. I tried original llama.cpp - result is the same.

Is there any chance for turbo4 to compare in speed with q8 or q4 ?

Raw results: speed.zip

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Perfomance: CUDA turbo4/turbo4 shows severe generation slowdown at large contexts #22258

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 0 comments

Select a reply

Uh oh!

Perfomance: CUDA turbo4/turbo4 shows severe generation slowdown at large contexts #22258

Uh oh!

Uh oh!

TheLanozavr Apr 22, 2026

Replies: 0 comments

TheLanozavr
Apr 22, 2026