Industrial-Grade Pytest for AI Agents.
Define test scenarios in YAML. Benchmark any agent — LangGraph, CrewAI, AutoGen, or custom. Get premium reports with pass/fail, tokens, latency, cost, and failure analysis.
In 2026, 52% of organizations still don't run automated evaluations on their multi-step agent workflows. Existing tools are either ecosystem-locked (LangSmith) or too academic (THUDM/AgentBench).
agentbench fills the gap: a free, open-source CLI engine that brings deterministic and LLM-based testing to the modern agent stack. Think of it as pytest meets k6 for autonomous AI.
pip install agentbenchname: "basic-research"
tasks:
- id: "compare-frameworks"
input: "Compare LangGraph and CrewAI for production systems in 2026."
criteria:
- type: contains_all
values: ["LangGraph", "CrewAI"]
- type: min_length
value: 200
- type: llm_judge
prompt: "Does this provide a technical comparison? Score 0-10."
threshold: 7
limits:
max_tokens: 50000
max_latency_seconds: 60agentbench run --scenario scenarios/research.yaml --agent my_module:MyAgentAdapter --format htmlOur reporter generates a premium, glassmorphism-styled HTML dashboard for every run.
- Dynamic Charts: Visualize pass/fail trends and latency spikes.
- Deep Observability: Click into any task to see raw inputs, outputs, and failing criteria.
- Cost Metrics: Real-time token counting and cost estimation.
Note
View a live example of the report aesthetics in the documentation.
graph TD
A[Scenario Loader] --> B[Parallel Runner]
B --> C[Agent Adapter]
C --> D[LangGraph / CrewAI / AutoGen]
B --> E[Evaluation Engine]
E --> F[Deterministic Evaluators]
E --> G[LLM-Judge / Semantic Check]
B --> H[Reporters]
H --> I[Rich CLI Table]
H --> J[Glassmorphism HTML]
H --> K[JSON Metadata]
- ⚡ Parallel Task Execution: Benchmark large scenarios 10x faster with managed
asyncioconcurrency. - 🛡️ Built-in Scenario Packs: Standardized benchmarks for
tool-use,research, anderror-recovery. - 👁️ Structured Observability: High-fidelity logging with
structlogfor easy ingestion into Datadog/Splunk. - 🔌 Framework Agnostic: A simple
AgentAdapterinterface allows you to test any agent in seconds. - 🐳 DevOps Ready: Includes an optimized
Dockerfile(usinguv) and a comprehensiveMakefile.
| Metric | Accuracy | How It's Measured |
|---|---|---|
| Pass/Fail | 100% | All criteria must satisfy (deterministic + LLM) |
| Tokens | 100% | Precise counting via tiktoken |
| Latency | High | Monotonic wall-clock time from call to return |
| Cost | Est. | Calculated from token count × model rates |
| Consistency | High | Pass rate across multiple runs (optional) |
We welcome contributions from the community! Please read our CONTRIBUTING.md to get started.
High-impact areas:
- New evaluators: (e.g., Trajectory efficiency, Tool-calling accuracy)
- Framework adapters: (Pre-built adapters for popular SDKs)
- Reporters: (Markdown, PDF, or Grafana dashboards)
MIT — Test everything. Trust nothing.