Considering getting a 2nd DGX Spark but need to know how well Qwen 397B will work #22399
Closed
jerkstorecaller
started this conversation in
General
Replies: 2 comments 5 replies
-
I am getting ~14 t/s with two DGX Sparks using RPC and RDMA:
build: 0f1bb60 (8946) Qwen3.5-397B-A17B-UD-IQ4_XS-RDMA.mp4 |
Beta Was this translation helpful? Give feedback.
5 replies
-
|
Thank you! 13 tok/sec oughta be enough for everybody. Will things like speculative decoding with draft model work in RPC mode, or nah? It gave me a big speed boost on Gemma 4 31B. |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
I've been a single-device llama.cpp user for a long time, never looked into RPC.
I'm considering buying a 2nd DGX Spark. A DGX Spark is 130GB unified memory, and leaves around 125GB left for model weights after OS + apps overhead.
Two Sparks are clustered with a 200Gbps Connect-X7 link. On VLLM, it allows a shared pool of memory to load a single large model, although from what I read, the memory overhead of VLLM is much larger than llama.cpp, so it's not quite 125GB x 2.
Can someone tell me if Qwen 397B would work well in practice with llama.cpp's RPC?
Qwen 3.5 397B A17B is 190GB for the Unsloth IQ4_XS quant, which seems great, leaving plenty for KV cache. The other 4-bit quants are way too big, Q4_K_M is 244GB for example, leaving nothing for KV cache.
Beta Was this translation helpful? Give feedback.
All reactions