Considering getting a 2nd DGX Spark but need to know how well Qwen 397B will work #22399

jerkstorecaller · 2026-04-26T17:35:43Z

jerkstorecaller
Apr 26, 2026

I've been a single-device llama.cpp user for a long time, never looked into RPC.

I'm considering buying a 2nd DGX Spark. A DGX Spark is 130GB unified memory, and leaves around 125GB left for model weights after OS + apps overhead.

Two Sparks are clustered with a 200Gbps Connect-X7 link. On VLLM, it allows a shared pool of memory to load a single large model, although from what I read, the memory overhead of VLLM is much larger than llama.cpp, so it's not quite 125GB x 2.

Can someone tell me if Qwen 397B would work well in practice with llama.cpp's RPC?

Qwen 3.5 397B A17B is 190GB for the Unsloth IQ4_XS quant, which seems great, leaving plenty for KV cache. The other 4-bit quants are way too big, Q4_K_M is 244GB for example, leaving nothing for KV cache.

rgerganov · 2026-04-27T11:30:17Z

rgerganov
Apr 27, 2026
Collaborator

Can someone tell me if Qwen 397B would work well in practice with llama.cpp's RPC?

I am getting ~14 t/s with two DGX Sparks using RPC and RDMA:

$ bin/llama-bench -hf unsloth/Qwen3.5-397B-A17B-GGUF:UD-IQ4_XS --rpc 169.254.48.241:50052 -mmp 0
ggml_cuda_init: found 1 CUDA devices (Total VRAM: 122572 MiB):
  Device 0: NVIDIA GB10, compute capability 12.1, VMM: yes, VRAM: 122572 MiB
RDMA probed: dev=rocep1s0f0 gid=3 RoCEv2 qpn=3439 inline=316
RDMA activated: qpn=3439->3288 mtu=1024 rx_depth=24

model	size	params	backend	ngl	mmap	test	t/s
qwen35moe 397B.A17B IQ4_XS - 4.25 bpw	176.69 GiB	396.35 B	CUDA,RPC	99	0	pp512	329.05 ± 2.39
qwen35moe 397B.A17B IQ4_XS - 4.25 bpw	176.69 GiB	396.35 B	CUDA,RPC	99	0	tg128	13.86 ± 0.03

build: 0f1bb60 (8946)

Qwen3.5-397B-A17B-UD-IQ4_XS-RDMA.mp4

5 replies

am17an Apr 27, 2026
Collaborator

What about -sm tensor?

rgerganov Apr 27, 2026
Collaborator

The performance is much worse with -sm tensor and I currently don't understand why. Initially I thought this is because of graph_compute() not being async but #18626 also doesn't bring improvements.

am17an Apr 27, 2026
Collaborator

Are you using NCCL?

rgerganov Apr 28, 2026
Collaborator

No, the RPC backend is not using NCCL

am17an Apr 28, 2026
Collaborator

Oh right, the meta backend sees two RPC backends so -sm tensor falls back to the slow all reduce path.

jerkstorecaller · 2026-04-29T18:48:05Z

jerkstorecaller
Apr 29, 2026
Author

Thank you! 13 tok/sec oughta be enough for everybody.

Will things like speculative decoding with draft model work in RPC mode, or nah? It gave me a big speed boost on Gemma 4 31B.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Considering getting a 2nd DGX Spark but need to know how well Qwen 397B will work #22399

Uh oh!

{{title}}

Uh oh!

Replies: 2 comments 5 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Select a reply

Uh oh!

Considering getting a 2nd DGX Spark but need to know how well Qwen 397B will work #22399

Uh oh!

jerkstorecaller Apr 26, 2026

Replies: 2 comments · 5 replies

Uh oh!

rgerganov Apr 27, 2026 Collaborator

Uh oh!

am17an Apr 27, 2026 Collaborator

Uh oh!

rgerganov Apr 27, 2026 Collaborator

Uh oh!

am17an Apr 27, 2026 Collaborator

Uh oh!

rgerganov Apr 28, 2026 Collaborator

Uh oh!

am17an Apr 28, 2026 Collaborator

Uh oh!

Uh oh!

jerkstorecaller Apr 29, 2026 Author

jerkstorecaller
Apr 26, 2026

Replies: 2 comments 5 replies

rgerganov
Apr 27, 2026
Collaborator

am17an Apr 27, 2026
Collaborator

rgerganov Apr 27, 2026
Collaborator

am17an Apr 27, 2026
Collaborator

rgerganov Apr 28, 2026
Collaborator

am17an Apr 28, 2026
Collaborator

jerkstorecaller
Apr 29, 2026
Author