Can custom grammars compose with tool calls? #22408
Replies: 3 comments 4 replies
-
|
To your questions: 1. Is the restriction intentional? It's a soft guard protecting a real architectural constraint. The sampler has one grammar slot. When tools are enabled with 2. Is there a composition model? Not currently, but the architecture supports one. The lazy tool grammar is completely dormant during the pre-trigger phase — 3. Should clients disable grammar when tools are present? For now, yes. Unless llama-server adds pre-trigger grammar support, the two can't coexist on the same request. Working patchTLDR: llama-cpp-pre-trigger-grammar-78433f6.patch I hit the same wall building a tool-calling companion AI. Without grammar constraints, Qwen3.6 spirals in reasoning — 3K-25K tokens brainstorming variants before producing output. The patch adds a Server-side: when grammar + tools are both present, the user grammar gets passed as response = client.chat.completions.create(
model="qwen3.6",
messages=[...],
tools=[...],
extra_body={
"grammar": 'root ::= "PLAN: " line "\\nSEND\\n"\nline ::= [^\\n]+'
}
)ResultsTested with Qwen3.6-35B-A3B on macOS Metal. 280+ test cases, tool calls fire correctly mid-conversation, grammar + tools coexist without error. Structurally 99%+ success rate. Caveat: we reverted in production. The compressed reasoning (60 tokens vs 500+ free-form) increased hallucination — the model commits faster but with lower accuracy. The structural metrics looked great but conversation quality degraded. The core tension: constrained reasoning prevents spiraling but gives the model less room to self-correct. We haven't found the right grammar design to balance this yet. |
Beta Was this translation helpful? Give feedback.
-
|
"> llama-server currently rejects requests that include both OpenAI-style tools and a custom grammar.\n\nThis is a real pain point. We hit this exact issue when trying to constrain agent responses to structured output formats while still allowing tool calls for dynamic information retrieval.\n\nThe use case: We wanted an agent to always respond in a specific JSON schema (for downstream processing), but also call tools like web search and API lookup when needed. The grammar constraint was perfect for the response format \u2014 but the moment you add tools, it is HTTP 400 city.\n\nOur workaround was ugly but functional:\n1. First pass: Send with tools only, let the model decide if it needs tools\n2. If it returns tool_calls: execute them, then do a second pass\n3. If it returns text: re-prompt with grammar constraint only\n\nBasically a two-phase approach that doubles your latency and token usage. Not ideal.\n\nThe composition model you described (grammar for normal text, internal tool-call constraint for tools) makes total sense architecturally. It is essentially what the structured output + function calling combo does in the OpenAI API \u2014 they handle it server-side by treating tool calls as a separate output channel.\n\nFor llama.cpp, it might be worth looking at how the tool-call parsing interacts with the grammar sampler. The issue is probably in the sampler \u2014 when a grammar is active, the grammar sampler constrains the logits, but the tool-call logic has its own token constraints. They are fighting over the same logits space.\n\nA potential implementation path: make the grammar sampler tool-call-aware. When the model generates a tool_call token sequence, temporarily suspend grammar enforcement and let the tool-call sampler take over. Resume grammar enforcement after the tool call is complete.\n\nWould love to see this on the roadmap. Structured reasoning traces + tool calling is a killer combo for agent reliability.\n\nMore on our agent tool-calling adventures: https://miaoquai.com/stories/ai-agent-ops-nightmare.html" |
Beta Was this translation helpful? Give feedback.
-
Yes, this is intentional. Even if composition were allowed, a user-defined grammar would need to be written very carefully to avoid matching content intended for the tool call grammar. Honestly, I don't think the complexity is worth it. It would probably just lead to more issues from user error. |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
llama-servercurrently rejects requests that include both OpenAI-styletoolsand a customgrammar.Is this an intentional long-term limitation, or is there a recommended way to compose tool calling with user-supplied GBNF grammars?
Some context:
I’m experimenting with grammar-constrained reasoning traces: constrain the model’s normal answer format with GBNF, but still allow tool calls when tools are available.
Sharing the script I used:
https://github.com/andthattoo/structured-cot/blob/tool-grammar-repro/llama_tool_grammar_repro.py
Native
llama-serverbuilt from currentllama.cppwith CUDA.Model is Qwen3.6-27B-GGUF
Tools only work
python3 llama_tool_grammar_repro.py \ --case tools_only \ --max-tokens 1024 verdict: tool_call parsed finish_reason: tool_calls tool_calls: [{"type":"function","function":{"name":"get_weather","arguments":"{\"city\":\"Paris\"}"}}]But tools + grammar is rejected:
python3 llama_tool_grammar_repro.py \ --case tools_plus_answer_grammar HTTP 400: {"error":{"code":400,"message":"Cannot use custom grammar constraints with tools.","type":"invalid_request_error"}}Question
Is the intended client behavior to disable custom grammar whenever tools is present?
Or is there a planned composition model where custom grammar constrains normal assistant text, while llama.cpp’s internal tool-call constraint handles tool calls?
I’m asking so client libraries can handle this correctly instead of guessing whether to retry without grammar.
Beta Was this translation helpful? Give feedback.
All reactions