Skip to content

Still getting ///// on 1.3.0. #2

@alrunan

Description

@alrunan

If I do f16 for cache-type-v it seems to be be OK, but tbq4 or tbq3 outputs "//////////"

Launcher:
C:\Users\johne\llama.cpp-turboquant\llama-server.exe ^
-m C:\Users\johne\Desktop\Models\Qwen3.5-27B-UD-Q5_K_XL.gguf ^
-mm C:\Users\johne\Desktop\Models\Qwen3.5-27B-mmproj-F16.gguf ^
-c 131072 ^
--no-mmap ^
--temp 0.7 ^
--top-p 0.8 ^
--top-k 20 ^
--min-p 0.00 ^
--presence-penalty 1.5 ^
--repeat-penalty 1.0 ^
--chat-template-kwargs "{"enable_thinking":false}" ^
--reasoning-budget 1 ^
--cache-type-k tbqp4 ^
--cache-type-v tbq4 ^
--flash-attn on

Console output:
ggml_cuda_init: found 1 CUDA devices (Total VRAM: 32606 MiB):
Device 0: NVIDIA GeForce RTX 5090, compute capability 12.0, VMM: yes, VRAM: 32606 MiB
Setting 'enable_thinking' via --chat-template-kwargs is deprecated. Use --reasoning on / --reasoning off instead.
←[0mmain: n_parallel is set to auto, using n_parallel = 4 and kv_unified = true
build: 1 (528bdb4) with MSVC 19.44.35225.0 for x64
system info: n_threads = 16, n_threads_batch = 16, total_threads = 32

system_info: n_threads = 16 (n_threads_batch = 16) / 32 | CUDA : ARCHS = 520 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 |

init: using 31 threads for HTTP server
start: binding port with default address family
main: loading model
srv load_model: loading model 'C:\Users\johne\Desktop\Models\Qwen3.5-27B-UD-Q5_K_XL.gguf'
common_init_result: fitting params to device memory, for bugs during this step try to reproduce them with -fit off, or provide --verbose logs if the bug only occurs with -fit on
llama_params_fit_impl: projected to use 21582 MiB of device memory vs. 30391 MiB of free device memory
llama_params_fit_impl: will leave 8808 >= 1024 MiB of free device memory, no changes needed
llama_params_fit: successfully fit params to free device memory
llama_params_fit: fitting params to free memory took 0.38 seconds
llama_model_load_from_file_impl: using device CUDA0 (NVIDIA GeForce RTX 5090) (0000:01:00.0) - 30991 MiB free
llama_model_loader: loaded meta data with 49 key-value pairs and 851 tensors from C:\Users\johne\Desktop\Models\Qwen3.5-27B-UD-Q5_K_XL.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv 0: general.architecture str = qwen35
llama_model_loader: - kv 1: general.type str = model
llama_model_loader: - kv 2: general.sampling.top_k i32 = 20
llama_model_loader: - kv 3: general.sampling.top_p f32 = 0.950000
llama_model_loader: - kv 4: general.sampling.temp f32 = 0.600000
llama_model_loader: - kv 5: general.name str = Qwen3.5-27B
llama_model_loader: - kv 6: general.basename str = Qwen3.5-27B
llama_model_loader: - kv 7: general.quantized_by str = Unsloth
llama_model_loader: - kv 8: general.size_label str = 27B
llama_model_loader: - kv 9: general.license str = apache-2.0
llama_model_loader: - kv 10: general.license.link str = https://huggingface.co/Qwen/Qwen3.5-2...
llama_model_loader: - kv 11: general.repo_url str = https://huggingface.co/unsloth
llama_model_loader: - kv 12: general.base_model.count u32 = 1
llama_model_loader: - kv 13: general.base_model.0.name str = Qwen3.5 27B
llama_model_loader: - kv 14: general.base_model.0.organization str = Qwen
llama_model_loader: - kv 15: general.base_model.0.repo_url str = https://huggingface.co/Qwen/Qwen3.5-27B
llama_model_loader: - kv 16: general.tags arr[str,3] = ["qwen3_5_moe", "unsloth", "image-tex...
llama_model_loader: - kv 17: qwen35.block_count u32 = 64
llama_model_loader: - kv 18: qwen35.context_length u32 = 262144
llama_model_loader: - kv 19: qwen35.embedding_length u32 = 5120
llama_model_loader: - kv 20: qwen35.feed_forward_length u32 = 17408
llama_model_loader: - kv 21: qwen35.attention.head_count u32 = 24
llama_model_loader: - kv 22: qwen35.attention.head_count_kv u32 = 4
llama_model_loader: - kv 23: qwen35.rope.dimension_sections arr[i32,4] = [11, 11, 10, 0]
llama_model_loader: - kv 24: qwen35.rope.freq_base f32 = 10000000.000000
llama_model_loader: - kv 25: qwen35.attention.layer_norm_rms_epsilon f32 = 0.000001
llama_model_loader: - kv 26: qwen35.attention.key_length u32 = 256
llama_model_loader: - kv 27: qwen35.attention.value_length u32 = 256
llama_model_loader: - kv 28: qwen35.ssm.conv_kernel u32 = 4
llama_model_loader: - kv 29: qwen35.ssm.state_size u32 = 128
llama_model_loader: - kv 30: qwen35.ssm.group_count u32 = 16
llama_model_loader: - kv 31: qwen35.ssm.time_step_rank u32 = 48
llama_model_loader: - kv 32: qwen35.ssm.inner_size u32 = 6144
llama_model_loader: - kv 33: qwen35.full_attention_interval u32 = 4
llama_model_loader: - kv 34: qwen35.rope.dimension_count u32 = 64
llama_model_loader: - kv 35: tokenizer.ggml.model str = gpt2
llama_model_loader: - kv 36: tokenizer.ggml.pre str = qwen35
llama_model_loader: - kv 37: tokenizer.ggml.tokens arr[str,248320] = ["!", """, "#", "$", "%", "&", "'", ...
llama_model_loader: - kv 38: tokenizer.ggml.token_type arr[i32,248320] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv 39: tokenizer.ggml.merges arr[str,247587] = ["─á ─á", "─á─á ─á─á", "i n", "─á t",...
llama_model_loader: - kv 40: tokenizer.ggml.eos_token_id u32 = 248046
llama_model_loader: - kv 41: tokenizer.ggml.padding_token_id u32 = 248055
llama_model_loader: - kv 42: tokenizer.chat_template str = {%- set image_count = namespace(value...
llama_model_loader: - kv 43: general.quantization_version u32 = 2
llama_model_loader: - kv 44: general.file_type u32 = 17
llama_model_loader: - kv 45: quantize.imatrix.file str = Qwen3.5-27B-GGUF/imatrix_unsloth.gguf
llama_model_loader: - kv 46: quantize.imatrix.dataset str = unsloth_calibration_Qwen3.5-27B.txt
llama_model_loader: - kv 47: quantize.imatrix.entries_count u32 = 496
llama_model_loader: - kv 48: quantize.imatrix.chunks_count u32 = 80
llama_model_loader: - type f32: 353 tensors
llama_model_loader: - type f16: 96 tensors
llama_model_loader: - type q8_0: 48 tensors
llama_model_loader: - type q4_K: 16 tensors
llama_model_loader: - type q5_K: 181 tensors
llama_model_loader: - type q6_K: 157 tensors
print_info: file format = GGUF V3 (latest)
print_info: file type = Q5_K - Medium
print_info: file size = 18.78 GiB (6.00 BPW)
load: 0 unused tokens
load: printing all EOG tokens:
load: - 248044 ('<|endoftext|>')
load: - 248046 ('<|im_end|>')
load: - 248063 ('<|fim_pad|>')
load: - 248064 ('<|repo_name|>')
load: - 248065 ('<|file_sep|>')
load: special tokens cache size = 33
load: token to piece cache size = 1.7581 MB
print_info: arch = qwen35
print_info: vocab_only = 0
print_info: no_alloc = 0
print_info: n_ctx_train = 262144
print_info: n_embd = 5120
print_info: n_embd_inp = 5120
print_info: n_layer = 64
print_info: n_head = 24
print_info: n_head_kv = 4
print_info: n_rot = 64
print_info: n_swa = 0
print_info: is_swa_any = 0
print_info: n_embd_head_k = 256
print_info: n_embd_head_v = 256
print_info: n_gqa = 6
print_info: n_embd_k_gqa = 1024
print_info: n_embd_v_gqa = 1024
print_info: f_norm_eps = 0.0e+00
print_info: f_norm_rms_eps = 1.0e-06
print_info: f_clamp_kqv = 0.0e+00
print_info: f_max_alibi_bias = 0.0e+00
print_info: f_logit_scale = 0.0e+00
print_info: f_attn_scale = 0.0e+00
print_info: n_ff = 17408
print_info: n_expert = 0
print_info: n_expert_used = 0
print_info: n_expert_groups = 0
print_info: n_group_used = 0
print_info: causal attn = 1
print_info: pooling type = -1
print_info: rope type = 40
print_info: rope scaling = linear
print_info: freq_base_train = 10000000.0
print_info: freq_scale_train = 1
print_info: n_ctx_orig_yarn = 262144
print_info: rope_yarn_log_mul = 0.0000
print_info: rope_finetuned = unknown
print_info: mrope sections = [11, 11, 10, 0]
print_info: ssm_d_conv = 4
print_info: ssm_d_inner = 6144
print_info: ssm_d_state = 128
print_info: ssm_dt_rank = 48
print_info: ssm_n_group = 16
print_info: ssm_dt_b_c_rms = 0
print_info: model type = 27B
print_info: model params = 26.90 B
print_info: general.name = Qwen3.5-27B
print_info: vocab type = BPE
print_info: n_vocab = 248320
print_info: n_merges = 247587
print_info: BOS token = 11 ','
print_info: EOS token = 248046 '<|im_end|>'
print_info: EOT token = 248046 '<|im_end|>'
print_info: PAD token = 248055 '<|vision_pad|>'
print_info: LF token = 198 '─è'
print_info: FIM PRE token = 248060 '<|fim_prefix|>'
print_info: FIM SUF token = 248062 '<|fim_suffix|>'
print_info: FIM MID token = 248061 '<|fim_middle|>'
print_info: FIM PAD token = 248063 '<|fim_pad|>'
print_info: FIM REP token = 248064 '<|repo_name|>'
print_info: FIM SEP token = 248065 '<|file_sep|>'
print_info: EOG token = 248044 '<|endoftext|>'
print_info: EOG token = 248046 '<|im_end|>'
print_info: EOG token = 248063 '<|fim_pad|>'
print_info: EOG token = 248064 '<|repo_name|>'
print_info: EOG token = 248065 '<|file_sep|>'
print_info: max token length = 256
load_tensors: loading model tensors, this can take a while... (mmap = false, direct_io = false)
load_tensors: offloading output layer to GPU
load_tensors: offloading 63 repeating layers to GPU
load_tensors: offloaded 65/65 layers to GPU
load_tensors: CPU model buffer size = 833.59 MiB
load_tensors: CUDA0 model buffer size = 18392.73 MiB
.............................................................................................
common_init_result: added <|endoftext|> logit bias = -inf
common_init_result: added <|im_end|> logit bias = -inf
common_init_result: added <|fim_pad|> logit bias = -inf
common_init_result: added <|repo_name|> logit bias = -inf
common_init_result: added <|file_sep|> logit bias = -inf
common_init_result: TurboQuant head_dim signals ΓÇö key=256 val=256 computed=213 mla_k=0 mla_v=0 swa_k=0
common_init_result: [P1Γ£ô P5Γ£ù] key_length=256 but n_embd/n_head=213 ΓÇö using P1
←[0mllama_context: constructing llama_context
llama_context: n_seq_max = 4
llama_context: n_ctx = 131072
llama_context: n_ctx_seq = 131072
llama_context: n_batch = 2048
llama_context: n_ubatch = 512
llama_context: causal_attn = 1
llama_context: flash_attn = enabled
llama_context: kv_unified = true
llama_context: freq_base = 10000000.0
llama_context: freq_scale = 1
llama_context: n_ctx_seq (131072) < n_ctx_train (262144) -- the full capacity of the model will not be utilized
←[0mllama_context: CUDA_Host output buffer size = 3.79 MiB
llama_kv_cache: CUDA0 KV buffer size = 2096.00 MiB
llama_kv_cache: size = 2096.00 MiB (131072 cells, 16 layers, 4/1 seqs), K (tbqp4_0): 1056.00 MiB, V (tbq4_0): 1040.00 MiB
llama_memory_recurrent: CUDA0 RS buffer size = 598.50 MiB
llama_memory_recurrent: size = 598.50 MiB ( 4 cells, 64 layers, 4 seqs), R (f32): 22.50 MiB, S (f32): 576.00 MiB
sched_reserve: reserving ...
sched_reserve: resolving fused Gated Delta Net support:
sched_reserve: fused Gated Delta Net (autoregressive) enabled
sched_reserve: fused Gated Delta Net (chunked) enabled
sched_reserve: CUDA0 compute buffer size = 495.00 MiB
sched_reserve: CUDA_Host compute buffer size = 276.02 MiB
sched_reserve: graph nodes = 3657
sched_reserve: graph splits = 2
sched_reserve: reserve took 24.26 ms, sched copies = 1
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
←[0mclip_model_loader: model name: Qwen3.5-27B
clip_model_loader: description:
clip_model_loader: GGUF version: 3
clip_model_loader: alignment: 32
clip_model_loader: n_tensors: 334
clip_model_loader: n_kv: 28

clip_model_loader: has vision encoder
clip_ctx: CLIP using CUDA0 backend
load_hparams: Qwen-VL models require at minimum 1024 image tokens to function correctly on grounding tasks
←[0mload_hparams: if you encounter problems with accuracy, try adding --image-min-tokens 1024
←[0mload_hparams: more info: https://github.com/ggml-org/llama.cpp/issues/16842

←[0mload_hparams: projector: qwen3vl_merger
load_hparams: n_embd: 1152
load_hparams: n_head: 16
load_hparams: n_ff: 4304
load_hparams: n_layer: 27
load_hparams: ffn_op: gelu
load_hparams: projection_dim: 5120

--- vision hparams ---
load_hparams: image_size: 768
load_hparams: patch_size: 16
load_hparams: has_llava_proj: 0
load_hparams: minicpmv_version: 0
load_hparams: n_merge: 2
load_hparams: n_wa_pattern: 0
load_hparams: image_min_pixels: 8192
load_hparams: image_max_pixels: 4194304

load_hparams: model size: 884.62 MiB
load_hparams: metadata size: 0.12 MiB
warmup: warmup with image size = 1472 x 1472
alloc_compute_meta: CUDA0 compute buffer size = 248.10 MiB
alloc_compute_meta: CPU compute buffer size = 24.93 MiB
alloc_compute_meta: graph splits = 1, nodes = 823
warmup: flash attention is enabled
srv load_model: loaded multimodal model, 'C:\Users\johne\Desktop\Models\Qwen3.5-27B-mmproj-F16.gguf'
srv load_model: initializing slots, n_slots = 4
common_speculative_is_compat: the target context does not support partial sequence removal
←[0msrv load_model: speculative decoding not supported by this context
←[0mslot load_model: id 0 | task -1 | new slot, n_ctx = 131072
slot load_model: id 1 | task -1 | new slot, n_ctx = 131072
slot load_model: id 2 | task -1 | new slot, n_ctx = 131072
slot load_model: id 3 | task -1 | new slot, n_ctx = 131072
srv load_model: prompt cache is enabled, size limit: 8192 MiB
←[0msrv load_model: use --cache-ram 0 to disable the prompt cache
←[0msrv load_model: for more info see https://github.com/ggml-org/llama.cpp/pull/16391
←[0minit: chat template, example_format: '<|im_start|>system
You are a helpful assistant<|im_end|>
<|im_start|>user
Hello<|im_end|>
<|im_start|>assistant
Hi there<|im_end|>
<|im_start|>user
How are you?<|im_end|>
<|im_start|>assistant

'
srv init: init: chat template, thinking = 1
main: model loaded
main: server is listening on http://127.0.0.1:8080
main: starting the main loop...
srv update_slots: all slots are idle
srv params_from_: Chat format: peg-native
slot get_availabl: id 3 | task -1 | selected slot by LRU, t_last = -1
slot launch_slot_: id 3 | task -1 | sampler chain: logits -> penalties -> ?dry -> ?top-n-sigma -> top-k -> ?typical -> top-p -> ?min-p -> ?xtc -> temp-ext -> dist
slot launch_slot_: id 3 | task 0 | processing task, is_child = 0
slot update_slots: id 3 | task 0 | new prompt, n_ctx_slot = 131072, n_keep = 0, task.n_tokens = 5437
slot update_slots: id 3 | task 0 | n_tokens = 0, memory_seq_rm [0, end)
srv log_server_r: done request: POST /v1/chat/completions 127.0.0.1 200
slot update_slots: id 3 | task 0 | prompt processing progress, n_tokens = 2048, batch.n_tokens = 2048, progress = 0.376678
slot update_slots: id 3 | task 0 | n_tokens = 2048, memory_seq_rm [2048, end)
slot update_slots: id 3 | task 0 | prompt processing progress, n_tokens = 4096, batch.n_tokens = 2048, progress = 0.753357
slot update_slots: id 3 | task 0 | n_tokens = 4096, memory_seq_rm [4096, end)
slot update_slots: id 3 | task 0 | prompt processing progress, n_tokens = 4921, batch.n_tokens = 825, progress = 0.905095
slot update_slots: id 3 | task 0 | n_tokens = 4921, memory_seq_rm [4921, end)
slot update_slots: id 3 | task 0 | prompt processing progress, n_tokens = 5433, batch.n_tokens = 512, progress = 0.999264
slot update_slots: id 3 | task 0 | created context checkpoint 1 of 32 (pos_min = 4920, pos_max = 4920, n_tokens = 4921, size = 149.626 MiB)
←[0mslot update_slots: id 3 | task 0 | n_tokens = 5433, memory_seq_rm [5433, end)
slot init_sampler: id 3 | task 0 | init sampler, took 0.64 ms, tokens: text = 5437, total = 5437
slot update_slots: id 3 | task 0 | prompt processing done, n_tokens = 5437, batch.n_tokens = 4
slot update_slots: id 3 | task 0 | created context checkpoint 2 of 32 (pos_min = 5432, pos_max = 5432, n_tokens = 5433, size = 149.626 MiB)
←[0msrv stop: cancel task, id_task = 0
←[0mslot release: id 3 | task 0 | stop processing: n_tokens = 5506, truncated = 0
srv update_slots: all slots are idle
srv params_from_: Chat format: peg-native
slot get_availabl: id 3 | task -1 | selected slot by LCP similarity, sim_best = 0.231 (> 0.100 thold), f_keep = 0.001
srv get_availabl: updating prompt cache
←[0msrv prompt_save: - saving prompt with length 5506, total state size = 237.779 MiB
←[0msrv load: - looking for better prompt, base f_keep = 0.001, sim = 0.231
←[0msrv update: - cache state: 1 prompts, 537.031 MiB (limits: 8192.000 MiB, 131072 tokens, 131072 est)
←[0msrv update: - prompt 0000029CA7A78BC0: 5506 tokens, checkpoints: 2, 537.031 MiB
←[0msrv get_availabl: prompt cache update took 74.25 ms
←[0mslot launch_slot_: id 3 | task -1 | sampler chain: logits -> penalties -> ?dry -> ?top-n-sigma -> top-k -> ?typical -> top-p -> ?min-p -> ?xtc -> temp-ext -> dist
slot launch_slot_: id 3 | task 76 | processing task, is_child = 0
slot update_slots: id 3 | task 76 | new prompt, n_ctx_slot = 131072, n_keep = 0, task.n_tokens = 13
slot update_slots: id 3 | task 76 | n_past = 3, slot.prompt.tokens.size() = 5506, seq_id = 3, pos_min = 5505, n_swa = 0
←[0mslot update_slots: id 3 | task 76 | Checking checkpoint with [5432, 5432] against 3...
slot update_slots: id 3 | task 76 | Checking checkpoint with [4920, 4920] against 3...
slot update_slots: id 3 | task 76 | forcing full prompt re-processing due to lack of cache data (likely due to SWA or hybrid/recurrent memory, see https://github.com/ggml-org/llama.cpp/pull/13194#issuecomment-2868343055)
←[0mslot update_slots: id 3 | task 76 | erased invalidated context checkpoint (pos_min = 4920, pos_max = 4920, n_tokens = 4921, n_swa = 0, pos_next = 0, size = 149.626 MiB)
←[0mslot update_slots: id 3 | task 76 | erased invalidated context checkpoint (pos_min = 5432, pos_max = 5432, n_tokens = 5433, n_swa = 0, pos_next = 0, size = 149.626 MiB)
←[0mslot update_slots: id 3 | task 76 | n_tokens = 0, memory_seq_rm [0, end)
srv log_server_r: done request: POST /v1/chat/completions 127.0.0.1 200
slot update_slots: id 3 | task 76 | prompt processing progress, n_tokens = 9, batch.n_tokens = 9, progress = 0.692308
slot update_slots: id 3 | task 76 | n_tokens = 9, memory_seq_rm [9, end)
slot init_sampler: id 3 | task 76 | init sampler, took 0.01 ms, tokens: text = 13, total = 13
slot update_slots: id 3 | task 76 | prompt processing done, n_tokens = 13, batch.n_tokens = 4
srv stop: cancel task, id_task = 76
←[0mslot release: id 3 | task 76 | stop processing: n_tokens = 72, truncated = 0
srv update_slots: all slots are idle

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions