Can custom grammars compose with tool calls? #22408

andthattoo · 2026-04-26T22:16:56Z

andthattoo
Apr 26, 2026

llama-server currently rejects requests that include both OpenAI-style tools and a custom grammar.

Is this an intentional long-term limitation, or is there a recommended way to compose tool calling with user-supplied GBNF grammars?

Some context:

I’m experimenting with grammar-constrained reasoning traces: constrain the model’s normal answer format with GBNF, but still allow tool calls when tools are available.

Sharing the script I used:

https://github.com/andthattoo/structured-cot/blob/tool-grammar-repro/llama_tool_grammar_repro.py

Native llama-server built from current llama.cpp with CUDA.

Model is Qwen3.6-27B-GGUF

REASONING_FORMAT=none BACKGROUND=1 ./run_llama_server.sh

Tools only work

python3 llama_tool_grammar_repro.py \
  --case tools_only \
  --max-tokens 1024

verdict: tool_call parsed
finish_reason: tool_calls
tool_calls: [{"type":"function","function":{"name":"get_weather","arguments":"{\"city\":\"Paris\"}"}}]

But tools + grammar is rejected:

python3 llama_tool_grammar_repro.py \
  --case tools_plus_answer_grammar

HTTP 400: {"error":{"code":400,"message":"Cannot use custom grammar constraints with tools.","type":"invalid_request_error"}}

Question

Is the intended client behavior to disable custom grammar whenever tools is present?

Or is there a planned composition model where custom grammar constrains normal assistant text, while llama.cpp’s internal tool-call constraint handles tool calls?

I’m asking so client libraries can handle this correctly instead of guessing whether to retry without grammar.

chaddotphp · 2026-04-27T12:10:29Z

chaddotphp
Apr 27, 2026

To your questions:

1. Is the restriction intentional? It's a soft guard protecting a real architectural constraint. The sampler has one grammar slot. When tools are enabled with tool_choice: "auto", the server auto-generates a lazy GBNF grammar from your tool schemas — dormant until the model emits a trigger token, then constraining to valid JSON. A user grammar would overwrite it (or vice versa). The guards in server-common.cpp and chat.cpp prevent silent breakage.

2. Is there a composition model? Not currently, but the architecture supports one. The lazy tool grammar is completely dormant during the pre-trigger phase — grammar_should_apply() returns false during COUNTING state (inside <think>). A user grammar could apply during that phase without conflicting, then yield to the tool grammar on trigger. I built a patch that does exactly this.

3. Should clients disable grammar when tools are present? For now, yes. Unless llama-server adds pre-trigger grammar support, the two can't coexist on the same request.

Working patch

TLDR: llama-cpp-pre-trigger-grammar-78433f6.patch

I hit the same wall building a tool-calling companion AI. Without grammar constraints, Qwen3.6 spirals in reasoning — 3K-25K tokens brainstorming variants before producing output.

The patch adds a pre_grmr (pre-trigger grammar) to common_sampler that applies only during the reasoning phase, then yields to the tool grammar when the model transitions to content/tool calls. ~130 line diff across 7 files, no changes to llama-grammar.h/cpp — everything is in the sampling orchestration layer. Requires --reasoning-format deepseek.

common_sampler {
    grmr      → tool grammar (lazy, applies AFTER reasoning)
    pre_grmr  → user grammar (non-lazy, applies DURING reasoning)  ← NEW
    rbudget   → reasoning budget tracker
    chain     → main sampling chain
}

Server-side: when grammar + tools are both present, the user grammar gets passed as pre_trigger_grammar and cleared from inputs.grammar so it doesn't interfere with tool grammar generation. A new llama_sampler_grammar_is_done() API detects when the user grammar reaches a terminal state, triggering the </think> transition.

response = client.chat.completions.create(
    model="qwen3.6",
    messages=[...],
    tools=[...],
    extra_body={
        "grammar": 'root ::= "PLAN: " line "\\nSEND\\n"\nline ::= [^\\n]+'
    }
)

Results

Tested with Qwen3.6-35B-A3B on macOS Metal. 280+ test cases, tool calls fire correctly mid-conversation, grammar + tools coexist without error. Structurally 99%+ success rate.

Caveat: we reverted in production. The compressed reasoning (60 tokens vs 500+ free-form) increased hallucination — the model commits faster but with lower accuracy. The structural metrics looked great but conversation quality degraded. The core tension: constrained reasoning prevents spiraling but gives the model less room to self-correct. We haven't found the right grammar design to balance this yet.

2 replies

andthattoo Apr 27, 2026
Author

This is extremely helpful, thank you. The pre-trigger grammar idea is exactly the composition model I was hoping existed.

One distinction: my results so far are almost entirely single-shot coding generations, not long multi-turn conversations.
I suspected one static grammar won’t be enough; the constraint probably needs to evolve with the interaction state and your tests kind proved that.

I’m planning to explore dynamic grammars, but not only hand-selecting a grammar per phase. The version I want to test is model-generated next-step constraints: the model emits the grammar or constraint for its next decoding step, then the server applies it on the following generation.

Will keep this conversation updated.

andthattoo Apr 28, 2026
Author

Confirmed the patch behavior on Qwen3.6-27B-GGUF via llama-server. Nice work!

With tools + grammar in the same request:

"grammar": "root ::= \"GOAL: \" line \"TOOL: \" line \"ARGS: \" line\nline ::= [^\\n]+ \"\\n\""

response:

{
  "case": "tools_plus_reasoning_inner_grammar",
  "request": {
    "model": "ggml-org/Qwen3.6-27B-GGUF",
    "messages": [
      {
        "role": "user",
        "content": "Use the get_weather tool for Paris. If no tool is available, answer exactly: ANSWER: Paris"
      }
    ],
    "temperature": 0,
    "max_tokens": 1024,
    "tools": [
      {
        "type": "function",
        "function": {
          "name": "get_weather",
          "description": "Get the current weather for a city.",
          "parameters": {
            "type": "object",
            "properties": {
              "city": {
                "type": "string",
                "description": "City name."
              }
            },
            "required": [
              "city"
            ]
          }
        }
      }
    ],
    "tool_choice": "auto",
    "grammar": "\nroot ::= \"GOAL: \" line \"TOOL: \" line \"ARGS: \" line\nline ::= [^\\n]+ \"\\n\"\n"
  },
  "ok": true,
  "response": {
    "choices": [
      {
        "finish_reason": "tool_calls",
        "index": 0,
        "message": {
          "role": "assistant",
          "content": "",
          "reasoning_content": "GOAL: Use the get_weather tool for Paris.\nTOOL: get_weather\nARGS: city=\"Paris\"\n",
          "tool_calls": [
            {
              "type": "function",
              "function": {
                "name": "get_weather",
                "arguments": "{\"city\":\"Paris\"}"
              },
              "id": "fwqVBYbhqo7dDnUes1OIRyNrY6EtgIdc"
            }
          ]
        }
      }
    ],
    "created": 1777373381,
    "model": "ggml-org/Qwen3.6-27B-GGUF",
    "system_fingerprint": "b8960-19821178b",
    "object": "chat.completion",
    "usage": {
      "completion_tokens": 56,
      "prompt_tokens": 297,
      "total_tokens": 353,
      "prompt_tokens_details": {
        "cached_tokens": 293
      }
    },
    "id": "chatcmpl-OdDkkGUnxdH9ahkoU24LW7eAyfpLWTBJ",
    "timings": {
      "cache_n": 293,
      "prompt_n": 4,
      "prompt_ms": 98.736,
      "prompt_per_token_ms": 24.684,
      "prompt_per_second": 40.512072597634095,
      "predicted_n": 56,
      "predicted_ms": 1339.748,
      "predicted_per_token_ms": 23.92407142857143,
      "predicted_per_second": 41.79890546580401
    }
  },
  "error": null
}

jingchang0623-crypto · 2026-04-27T12:14:26Z

jingchang0623-crypto
Apr 27, 2026

"> llama-server currently rejects requests that include both OpenAI-style tools and a custom grammar.\n\nThis is a real pain point. We hit this exact issue when trying to constrain agent responses to structured output formats while still allowing tool calls for dynamic information retrieval.\n\nThe use case: We wanted an agent to always respond in a specific JSON schema (for downstream processing), but also call tools like web search and API lookup when needed. The grammar constraint was perfect for the response format \u2014 but the moment you add tools, it is HTTP 400 city.\n\nOur workaround was ugly but functional:\n1. First pass: Send with tools only, let the model decide if it needs tools\n2. If it returns tool_calls: execute them, then do a second pass\n3. If it returns text: re-prompt with grammar constraint only\n\nBasically a two-phase approach that doubles your latency and token usage. Not ideal.\n\nThe composition model you described (grammar for normal text, internal tool-call constraint for tools) makes total sense architecturally. It is essentially what the structured output + function calling combo does in the OpenAI API \u2014 they handle it server-side by treating tool calls as a separate output channel.\n\nFor llama.cpp, it might be worth looking at how the tool-call parsing interacts with the grammar sampler. The issue is probably in the sampler \u2014 when a grammar is active, the grammar sampler constrains the logits, but the tool-call logic has its own token constraints. They are fighting over the same logits space.\n\nA potential implementation path: make the grammar sampler tool-call-aware. When the model generates a tool_call token sequence, temporarily suspend grammar enforcement and let the tool-call sampler take over. Resume grammar enforcement after the tool call is complete.\n\nWould love to see this on the roadmap. Structured reasoning traces + tool calling is a killer combo for agent reliability.\n\nMore on our agent tool-calling adventures: https://miaoquai.com/stories/ai-agent-ops-nightmare.html"

0 replies

aldehir · 2026-04-27T17:17:01Z

aldehir
Apr 27, 2026
Collaborator

Is this an intentional long-term limitation, or is there a recommended way to compose tool calling with user-supplied GBNF grammars?

Yes, this is intentional. Even if composition were allowed, a user-defined grammar would need to be written very carefully to avoid matching content intended for the tool call grammar.

Honestly, I don't think the complexity is worth it. It would probably just lead to more issues from user error.

2 replies

aldehir Apr 27, 2026
Collaborator

On top of that, the grammar constraining implementation is pretty inefficient. It can definitely constrain model output, but it's mostly used as guardrails and expects the model to emit mostly correct output, with the grammar nudging it when it deviates. If the grammar is heavily enforced (e.g. the model doesn't know it's constrained, so its raw output deviates a lot), you'll see token generation tps tank.

That's why it's important to prompt the model with the desired output format. Tool calling already handles this because the chat template enumerates tool definitions in the system prompt. With a user-defined grammar, you'd have to do that manually, and I can already imagine how many users would get tripped up by it.

andthattoo Apr 27, 2026
Author

That makes sense for arbitrary composition.

The narrower thing I’m interested in is not arbitrary grammar composition. It is a reasoning-phase grammar.

The user grammar would apply only while the model is generating the reasoning span, e.g. inside <think>...</think>. Once the reasoning grammar reaches its terminal state / emits </think>, it is disabled. After that, normal generation resumes; if the model enters a tool-call path, llama.cpp’s internally generated lazy tool grammar can take over exactly as it does today.

So users would not need to write grammar for tool-call syntax, and the existing tool parser behavior would remain unchanged.

I haven’t tested the multi-turn agent setting yet, where the right reasoning grammar may need to change by step.

The motivation is that it may lead to reduced thinking tokens during generation. Here's my writeup on this.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Can custom grammars compose with tool calls? #22408

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 3 comments 4 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Select a reply

Uh oh!

Can custom grammars compose with tool calls? #22408

Uh oh!

Uh oh!

andthattoo Apr 26, 2026

Replies: 3 comments · 4 replies

Uh oh!

chaddotphp Apr 27, 2026

Working patch

Results

Uh oh!

andthattoo Apr 27, 2026 Author

Uh oh!

andthattoo Apr 28, 2026 Author

Uh oh!

jingchang0623-crypto Apr 27, 2026

Uh oh!

aldehir Apr 27, 2026 Collaborator

Uh oh!

aldehir Apr 27, 2026 Collaborator

Uh oh!

Uh oh!

andthattoo Apr 27, 2026 Author

andthattoo
Apr 26, 2026

Replies: 3 comments 4 replies

chaddotphp
Apr 27, 2026

andthattoo Apr 27, 2026
Author

andthattoo Apr 28, 2026
Author

jingchang0623-crypto
Apr 27, 2026

aldehir
Apr 27, 2026
Collaborator

aldehir Apr 27, 2026
Collaborator

andthattoo Apr 27, 2026
Author