agentbench

Industrial-Grade Pytest for AI Agents.

Define test scenarios in YAML. Benchmark any agent — LangGraph, CrewAI, AutoGen, or custom. Get premium reports with pass/fail, tokens, latency, cost, and failure analysis.

Quick Start · Features · Architecture · Reports · Contributing

🏗️ The Enterprise Challenge

In 2026, 52% of organizations still don't run automated evaluations on their multi-step agent workflows. Existing tools are either ecosystem-locked (LangSmith) or too academic (THUDM/AgentBench).

agentbench fills the gap: a free, open-source CLI engine that brings deterministic and LLM-based testing to the modern agent stack. Think of it as pytest meets k6 for autonomous AI.

⚡ Quick Start

1. Install via `uv` or `pip`

pip install agentbench

2. Define a Scenario (`research.yaml`)

name: "basic-research"
tasks:
  - id: "compare-frameworks"
    input: "Compare LangGraph and CrewAI for production systems in 2026."
    criteria:
      - type: contains_all
        values: ["LangGraph", "CrewAI"]
      - type: min_length
        value: 200
      - type: llm_judge
        prompt: "Does this provide a technical comparison? Score 0-10."
        threshold: 7
    limits:
      max_tokens: 50000
      max_latency_seconds: 60

3. Run with Your Agent

agentbench run --scenario scenarios/research.yaml --agent my_module:MyAgentAdapter --format html

🎨 Professional Visualization

Our reporter generates a premium, glassmorphism-styled HTML dashboard for every run.

Dynamic Charts: Visualize pass/fail trends and latency spikes.
Deep Observability: Click into any task to see raw inputs, outputs, and failing criteria.
Cost Metrics: Real-time token counting and cost estimation.

Note

View a live example of the report aesthetics in the documentation.

🧩 Architecture

graph TD
    A[Scenario Loader] --> B[Parallel Runner]
    B --> C[Agent Adapter]
    C --> D[LangGraph / CrewAI / AutoGen]
    B --> E[Evaluation Engine]
    E --> F[Deterministic Evaluators]
    E --> G[LLM-Judge / Semantic Check]
    B --> H[Reporters]
    H --> I[Rich CLI Table]
    H --> J[Glassmorphism HTML]
    H --> K[JSON Metadata]

🚀 Key Features (FAANG Grade)

⚡ Parallel Task Execution: Benchmark large scenarios 10x faster with managed asyncio concurrency.
🛡️ Built-in Scenario Packs: Standardized benchmarks for tool-use, research, and error-recovery.
👁️ Structured Observability: High-fidelity logging with structlog for easy ingestion into Datadog/Splunk.
🔌 Framework Agnostic: A simple AgentAdapter interface allows you to test any agent in seconds.
🐳 DevOps Ready: Includes an optimized Dockerfile (using uv) and a comprehensive Makefile.

📊 Core Metrics Measured

Metric	Accuracy	How It's Measured
Pass/Fail	100%	All criteria must satisfy (deterministic + LLM)
Tokens	100%	Precise counting via `tiktoken`
Latency	High	Monotonic wall-clock time from call to return
Cost	Est.	Calculated from token count × model rates
Consistency	High	Pass rate across multiple runs (optional)

🤝 Contributing

We welcome contributions from the community! Please read our CONTRIBUTING.md to get started.

High-impact areas:

New evaluators: (e.g., Trajectory efficiency, Tool-calling accuracy)
Framework adapters: (Pre-built adapters for popular SDKs)
Reporters: (Markdown, PDF, or Grafana dashboards)

📜 License

MIT — Test everything. Trust nothing.

_{Built with ❤️ by Ismail Sajid (Re-architected for FAANG by Antigravity AI)}

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
.github/workflows		.github/workflows
scenarios		scenarios
src/agentbench		src/agentbench
tests		tests
.gitignore		.gitignore
AUDIT.md		AUDIT.md
CONTRIBUTING.md		CONTRIBUTING.md
Dockerfile		Dockerfile
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
mock_agent.py		mock_agent.py
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

agentbench

Quick Start · Features · Architecture · Reports · Contributing

🏗️ The Enterprise Challenge

⚡ Quick Start

1. Install via `uv` or `pip`

2. Define a Scenario (`research.yaml`)

3. Run with Your Agent

🎨 Professional Visualization

🧩 Architecture

🚀 Key Features (FAANG Grade)

📊 Core Metrics Measured

🤝 Contributing

📜 License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

agentbench

Quick Start · Features · Architecture · Reports · Contributing

🏗️ The Enterprise Challenge

⚡ Quick Start

1. Install via uv or pip

2. Define a Scenario (research.yaml)

3. Run with Your Agent

🎨 Professional Visualization

🧩 Architecture

🚀 Key Features (FAANG Grade)

📊 Core Metrics Measured

🤝 Contributing

📜 License

About

Topics

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

1. Install via `uv` or `pip`

2. Define a Scenario (`research.yaml`)

Packages