| Model | HuggingFace ID (example) | Feature Flag | Notes |
|---|---|---|---|
| LLaMA 3.x | meta-llama/Llama-3.2-1B-Instruct |
llama (default) |
Also covers SmolLM2 and DeepSeek-R1 distilled variants |
| SmolLM2 | evilsocket/SmolLM2-1.7B-Instruct |
llama (default) |
LLaMA architecture, 135M-1.7B |
| Qwen2 / Qwen2.5 | evilsocket/Qwen2.5-Coder-1.5B-Instruct |
qwen2 (default) |
|
| Qwen3 (dense) | evilsocket/Qwen3-0.6B |
qwen3 (default) |
GQA + QK-norm, thinking mode via /think |
| Qwen3 MoE | Qwen/Qwen3-30B-A3B |
qwen3_moe (default) |
Sparse MoE FFN, 128 experts / top-8 per token |
| Qwen3.5 | evilsocket/Qwen3.5-0.8B |
qwen3_5 (default) |
Hybrid GDN linear + full attention |
| Qwen3.5 MoE | Qwen/Qwen3.5-35B-A3B-GPTQ-Int4 |
qwen3_5_moe (default) |
Hybrid GDN+full-attn + sparse MoE FFN, 256 experts / top-8; GPTQ-Int4 |
| Phi-4-mini | evilsocket/Phi-4-mini-instruct |
phi4 (default) |
3.8B, partial RoPE, 200K vocab |
| Phi-4 | microsoft/phi-4 |
phi4 (default) |
14B, same family as Phi-4-mini |
| Mistral | evilsocket/Mistral-7B-Instruct-v0.3 |
mistral (default) |
Standard GQA, optional sliding window |
| Gemma 3 | google/gemma-3-1b-it |
gemma3 (default) |
Interleaved local/global attention, GELU-tanh MLP |
| Falcon3 | evilsocket/Falcon3-1B-Instruct |
falcon3 (default) |
Standard GQA, Apache 2.0 |
| OLMo 2 | allenai/OLMo-2-1124-7B |
olmo2 (default) |
Post-norm, QK-norm, fully open weights+data |
| EXAONE 4.0 | evilsocket/EXAONE-4.0-1.2B |
exaone4 (default) |
3:1 local/global hybrid, QK-norm |
| DeepSeek-R1 (distilled) | deepseek-ai/DeepSeek-R1-Distill-Llama-8B |
llama (default) |
LLaMA or Qwen2.5 base |
| Model | HuggingFace ID | Architecture | VRAM | Feature Flag |
|---|---|---|---|---|
| Stable Diffusion 1.5 | (HF cache) | SD | ~4 GB | - |
| Stable Diffusion 2.1 | (HF cache) | SD | ~5 GB | - |
| Stable Diffusion XL | (HF cache) | SD | ~7 GB | - |
| SDXL Turbo | (HF cache) | SD | ~7 GB | - |
| FLUX.2-klein-4B | black-forest-labs/FLUX.2-klein-4B |
FLUX | ~8 GB | flux (default) |
| FLUX.1-dev (FP8) | evilsocket/flux1-dev |
FLUX | ~12 GB | flux (default) |
See Image Generation for usage.
| Model | HuggingFace ID | VRAM | Architecture | Feature Flag |
|---|---|---|---|---|
| LuxTTS | evilsocket/luxtts (original) |
<1 GB | Zipformer + flow matching | luxtts (default) |
| VibeVoice-1.5B | evilsocket/VibeVoice-1.5B |
~7 GB | Qwen2.5 LM + diffusion | vibevoice (default) |
| VibeVoice-Realtime-0.5B | evilsocket/VibeVoice-Realtime-0.5B |
~3 GB | Qwen2.5 LM + diffusion | vibevoice (default) |
See Voice Generation for usage.
Text model architecture is auto-detected from config.json in the model directory. You can also set it explicitly:
cake run /path/to/model --text-model-arch auto|llama|qwen2|qwen3|qwen3-moe|qwen3-5|qwen3-5moe|phi4|mistral|gemma3|falcon3|ol-mo2|exaone4Image and voice model types are selected with --model-type:
cake run evilsocket/flux1-dev --model-type image-model --image-model-arch sd|flux|flux1
cake run evilsocket/VibeVoice-1.5B --model-type audio-model # VibeVoice (auto-detects 1.5B vs 0.5B)LuxTTS is detected automatically from its config.json architecture string and uses the text model dispatch path (enabling distributed inference):
cake run evilsocket/luxtts --prompt "Hello world" --audio-output output.wavSmolLM2 uses the LLaMA architecture (model_type: "llama" in config.json) and loads automatically via the llama feature. DeepSeek-R1 distilled variants built on LLaMA 3 or Qwen2.5 also work with no extra code.
Qwen3 dense models (0.6B-32B) extend the Qwen2.5 design with QK-norm on query and key projections. They support a dual-mode thinking toggle: prefix your prompt with /think or /no_think to enable or disable the reasoning chain.
Qwen3.5 is a hybrid linear/full attention model using Gated DeltaNet (GDN). It uses recurrent (linear) attention for 18 of 24 layers and standard softmax attention for the remaining 6, in a repeating 3:1 pattern. It is a "thinking" model — responses begin with a <think>...</think> reasoning block before the answer.
Both models share the phi4 feature and are loaded via the same code path. They use pre-fused QKV and gate+up projections, a 200K-token vocabulary, and partial RoPE (partial_rotary_factor: 0.25).
Gemma 3 uses an interleaved local/global attention pattern: every 6th layer is a global attention layer with full context and RoPE, while the rest use sliding-window (local) attention without RoPE. The MLP uses GELU-tanh activation (not SiLU), and embeddings are scaled by sqrt(hidden_size). All norms use GemmaRMSNorm (weights initialized to zero, forward = (1+weight) * norm(x)).
The Gemma 3 IT chat template has no separate system role — the system prompt is prepended to the first user turn.
Note: The 1B model does not benefit from a system prompt. Use --system-prompt "" for best results with Gemma 3 1B.
Mistral models use standard GQA and optionally sliding-window attention (4096-token window on Mistral Small). They load via the mistral feature.
Falcon3 models use standard GQA + SwiGLU, similar to LLaMA 3. They use ChatML-style tokenization. Released under Apache 2.0.
OLMo 2 uses post-norm (RMSNorm applied after the residual add, not before) and QK-norm. It is fully open: weights, training data, and code are all public. The 7B model requires a cluster to run (does not fit on a single 16 GB GPU).
EXAONE 4.0 uses a 3:1 local/global hybrid attention pattern where global layers use full context without RoPE, similar to Gemma 3. It includes QK-norm and is strong on multilingual and reasoning tasks.
Qwen3 MoE (30B-A3B and 235B-A22B) uses the same attention block as dense Qwen3 (GQA + QK-norm) but replaces the dense FFN with a Sparse Mixture-of-Experts layer. Each layer has 128 experts; the router selects the top-8 per token using softmax -> top-K -> renormalize. These models are ideal cluster targets: the 30B model activates only 3B parameters per token, while the 235B model activates 22B — both require multiple nodes to hold all expert weights.