[ICCV 2025] AdsQA: Towards Advertisement Video Understanding Arxiv: https://arxiv.org/abs/2509.08621
-
Updated
Oct 30, 2025 - Python
[ICCV 2025] AdsQA: Towards Advertisement Video Understanding Arxiv: https://arxiv.org/abs/2509.08621
Community benchmark database for running LLMs on Apple Silicon Macs
Benchmark LLMs on real professional tasks, not academic puzzles. YAML-driven experiment pipeline + live React dashboard for GDPVal Gold Subset (220 tasks across 11 industries).
Benchmark for evaluating AI epistemic reliability - testing how well LLMs handle uncertainty, avoid hallucinations, and acknowledge what they don't know.
An agent evaluation framework with native multi-turn feedback iteration.
CapBencher toolkit: Give your LLM benchmark a built-in alarm for leakage and gaming
Testing how well LLMs can solve jigsaw puzzles
Open-source multi-agent AI debate arena: pit Claude, GPT, Gemini, Ollama & HuggingFace models against each other with frozen-context fairness, evidence-first judging, 20+ personas, code review, and PDF/Markdown reports. CLI + Web UI.
Comprehensive benchmark of OpenRouter free-tier LLMs for practical applications. Evaluates models for coding, Thai language, and general use.
🚀 A modern, production-ready refactor of the LoCoMo long-term memory benchmark.
Benchmark for evaluating safety of AI agents in irreversible financial decisions (crypto payment settlement, consensus conflicts, replay attacks, finality races).
🔍 Benchmark jailbreak resilience in LLMs with JailBench for clear insights and improved model defenses against jailbreak attempts.
LiveSecBench(大模型动态安全测评基准)是大模型安全领域的专业、动态、多维度测评基准。我们致力于通过科学、系统、持续演进的测评体系,客观评估与衡量大模型的安全性能,推动大模型技术向更安全、更可靠、更负责任的方向发展,为产业落地和学术研究提供关键的安全标尺。
Исследовательский вопрос: можно ли измерить «офисный интеллект» LLM? Попытка — здесь. 100 сценариев, 10 критериев, русский корпоративный контекст.
Benchmark LLM jailbreak resilience across providers with standardized tests, adversarial mode, rich analytics, and a clean Web UI.
RetardBench is an open, no-censorship benchmark that ranks large language models purely on how retarded they are.
Benchmark LLMs Spatial Reasoning with Head-to-Head Bananagrams
Community-driven behavioral reliability benchmark for LLMs. 88 probes across 24 categories, deterministic TrustScore, hardware-stratified community rankings, performance prediction. Every test contributes to the community dataset.
Benchmarking LLM decision-making in structured, adversarial environments using game-based evaluation.
Claude Code skill that pits Claude, ChatGPT, and Gemini against each other, then lets them cross-judge each other blind
Add a description, image, and links to the llm-benchmark topic page so that developers can more easily learn about it.
To associate your repository with the llm-benchmark topic, visit your repo's landing page and select "manage topics."