Supported Models

Text Models

Model	HuggingFace ID (example)	Feature Flag	Notes
LLaMA 3.x	`meta-llama/Llama-3.2-1B-Instruct`	`llama` (default)	Also covers SmolLM2 and DeepSeek-R1 distilled variants
SmolLM2	`evilsocket/SmolLM2-1.7B-Instruct`	`llama` (default)	LLaMA architecture, 135M-1.7B
Qwen2 / Qwen2.5	`evilsocket/Qwen2.5-Coder-1.5B-Instruct`	`qwen2` (default)
Qwen3 (dense)	`evilsocket/Qwen3-0.6B`	`qwen3` (default)	GQA + QK-norm, thinking mode via `/think`
Qwen3 MoE	`Qwen/Qwen3-30B-A3B`	`qwen3_moe` (default)	Sparse MoE FFN, 128 experts / top-8 per token
Qwen3.5	`evilsocket/Qwen3.5-0.8B`	`qwen3_5` (default)	Hybrid GDN linear + full attention
Qwen3.5 MoE	`Qwen/Qwen3.5-35B-A3B-GPTQ-Int4`	`qwen3_5_moe` (default)	Hybrid GDN+full-attn + sparse MoE FFN, 256 experts / top-8; GPTQ-Int4
Phi-4-mini	`evilsocket/Phi-4-mini-instruct`	`phi4` (default)	3.8B, partial RoPE, 200K vocab
Phi-4	`microsoft/phi-4`	`phi4` (default)	14B, same family as Phi-4-mini
Mistral	`evilsocket/Mistral-7B-Instruct-v0.3`	`mistral` (default)	Standard GQA, optional sliding window
Gemma 3	`google/gemma-3-1b-it`	`gemma3` (default)	Interleaved local/global attention, GELU-tanh MLP
Falcon3	`evilsocket/Falcon3-1B-Instruct`	`falcon3` (default)	Standard GQA, Apache 2.0
OLMo 2	`allenai/OLMo-2-1124-7B`	`olmo2` (default)	Post-norm, QK-norm, fully open weights+data
EXAONE 4.0	`evilsocket/EXAONE-4.0-1.2B`	`exaone4` (default)	3:1 local/global hybrid, QK-norm
DeepSeek-R1 (distilled)	`deepseek-ai/DeepSeek-R1-Distill-Llama-8B`	`llama` (default)	LLaMA or Qwen2.5 base

Image Models

Model	HuggingFace ID	Architecture	VRAM	Feature Flag
Stable Diffusion 1.5	(HF cache)	SD	~4 GB	-
Stable Diffusion 2.1	(HF cache)	SD	~5 GB	-
Stable Diffusion XL	(HF cache)	SD	~7 GB	-
SDXL Turbo	(HF cache)	SD	~7 GB	-
FLUX.2-klein-4B	`black-forest-labs/FLUX.2-klein-4B`	FLUX	~8 GB	`flux` (default)
FLUX.1-dev (FP8)	`evilsocket/flux1-dev`	FLUX	~12 GB	`flux` (default)

See Image Generation for usage.

Voice Models (TTS)

Model	HuggingFace ID	VRAM	Architecture	Feature Flag
LuxTTS	`evilsocket/luxtts` (original)	<1 GB	Zipformer + flow matching	`luxtts` (default)
VibeVoice-1.5B	`evilsocket/VibeVoice-1.5B`	~7 GB	Qwen2.5 LM + diffusion	`vibevoice` (default)
VibeVoice-Realtime-0.5B	`evilsocket/VibeVoice-Realtime-0.5B`	~3 GB	Qwen2.5 LM + diffusion	`vibevoice` (default)

See Voice Generation for usage.

Architecture Detection

Text model architecture is auto-detected from config.json in the model directory. You can also set it explicitly:

cake run /path/to/model --text-model-arch auto|llama|qwen2|qwen3|qwen3-moe|qwen3-5|qwen3-5moe|phi4|mistral|gemma3|falcon3|ol-mo2|exaone4

Image and voice model types are selected with --model-type:

cake run evilsocket/flux1-dev --model-type image-model --image-model-arch sd|flux|flux1
cake run evilsocket/VibeVoice-1.5B --model-type audio-model  # VibeVoice (auto-detects 1.5B vs 0.5B)

LuxTTS is detected automatically from its config.json architecture string and uses the text model dispatch path (enabling distributed inference):

cake run evilsocket/luxtts --prompt "Hello world" --audio-output output.wav

Model Notes

LLaMA 3.x / SmolLM2 / DeepSeek-R1 Distilled

SmolLM2 uses the LLaMA architecture (model_type: "llama" in config.json) and loads automatically via the llama feature. DeepSeek-R1 distilled variants built on LLaMA 3 or Qwen2.5 also work with no extra code.

Qwen3 (dense)

Qwen3 dense models (0.6B-32B) extend the Qwen2.5 design with QK-norm on query and key projections. They support a dual-mode thinking toggle: prefix your prompt with /think or /no_think to enable or disable the reasoning chain.

Qwen3.5

Qwen3.5 is a hybrid linear/full attention model using Gated DeltaNet (GDN). It uses recurrent (linear) attention for 18 of 24 layers and standard softmax attention for the remaining 6, in a repeating 3:1 pattern. It is a "thinking" model — responses begin with a <think>...</think> reasoning block before the answer.

Phi-4-mini / Phi-4

Both models share the phi4 feature and are loaded via the same code path. They use pre-fused QKV and gate+up projections, a 200K-token vocabulary, and partial RoPE (partial_rotary_factor: 0.25).

Gemma 3

Gemma 3 uses an interleaved local/global attention pattern: every 6th layer is a global attention layer with full context and RoPE, while the rest use sliding-window (local) attention without RoPE. The MLP uses GELU-tanh activation (not SiLU), and embeddings are scaled by sqrt(hidden_size). All norms use GemmaRMSNorm (weights initialized to zero, forward = (1+weight) * norm(x)).

The Gemma 3 IT chat template has no separate system role — the system prompt is prepended to the first user turn.

Note: The 1B model does not benefit from a system prompt. Use --system-prompt "" for best results with Gemma 3 1B.

Mistral

Mistral models use standard GQA and optionally sliding-window attention (4096-token window on Mistral Small). They load via the mistral feature.

Falcon3

Falcon3 models use standard GQA + SwiGLU, similar to LLaMA 3. They use ChatML-style tokenization. Released under Apache 2.0.

OLMo 2

OLMo 2 uses post-norm (RMSNorm applied after the residual add, not before) and QK-norm. It is fully open: weights, training data, and code are all public. The 7B model requires a cluster to run (does not fit on a single 16 GB GPU).

EXAONE 4.0

EXAONE 4.0 uses a 3:1 local/global hybrid attention pattern where global layers use full context without RoPE, similar to Gemma 3. It includes QK-norm and is strong on multilingual and reasoning tasks.

Qwen3 MoE

Qwen3 MoE (30B-A3B and 235B-A22B) uses the same attention block as dense Qwen3 (GQA + QK-norm) but replaces the dense FFN with a Sparse Mixture-of-Experts layer. Each layer has 128 experts; the router selects the top-8 per token using softmax -> top-K -> renormalize. These models are ideal cluster targets: the 30B model activates only 3B parameters per token, while the 235B model activates 22B — both require multiple nodes to hold all expert weights.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Supported Models

Text Models

Image Models

Voice Models (TTS)

Architecture Detection

Model Notes

LLaMA 3.x / SmolLM2 / DeepSeek-R1 Distilled

Qwen3 (dense)

Qwen3.5

Phi-4-mini / Phi-4

Gemma 3

Mistral

Falcon3

OLMo 2

EXAONE 4.0

Qwen3 MoE

Uh oh!

FilesExpand file tree

models.md

Latest commit

History

models.md

File metadata and controls

Supported Models

Text Models

Image Models

Voice Models (TTS)

Architecture Detection

Model Notes

LLaMA 3.x / SmolLM2 / DeepSeek-R1 Distilled

Qwen3 (dense)

Qwen3.5

Phi-4-mini / Phi-4

Gemma 3

Mistral

Falcon3

OLMo 2

EXAONE 4.0

Qwen3 MoE