Add SWE-Bench Pro benchmark support by neubig · Pull Request #699 · OpenHands/benchmarks

neubig · 2026-04-25T00:28:15Z

Summary

add a new benchmarks/swebenchpro benchmark package that reuses the existing SWE-Bench inference flow via subclass hooks and benchmark-specific build/eval wrappers
add SWE-Bench Pro documentation, CLI entrypoints, and a reusable build-swebenchpro-images workflow with PR labels for 50/200/full image builds
add focused unit coverage for SWE-Bench Pro image selection and evaluation-format/report generation helpers

Validation

uv run pytest tests/test_swebench_eval_infer.py tests/test_swebenchpro.py tests/test_phased_build.py tests/test_prompt_path.py
uv run pre-commit run --files README.md pyproject.toml .github/workflows/build-swebenchpro-images.yml benchmarks/swebench/build_base_images.py benchmarks/swebench/run_infer.py benchmarks/swebenchpro/__init__.py benchmarks/swebenchpro/constants.py benchmarks/swebenchpro/config.py benchmarks/swebenchpro/build_images.py benchmarks/swebenchpro/run_infer.py benchmarks/swebenchpro/eval_infer.py benchmarks/swebenchpro/README.md benchmarks/swebenchpro/prompts/default.j2 tests/test_swebenchpro.py
uv run swebenchpro-infer --help
uv run swebenchpro-eval --help

Notes

The new image-build workflow is callable from other repos (including openhands/software-agent-sdk) via workflow_call.
The official SWE-Bench Pro harness is invoked through a thin wrapper that downloads a pinned checkout on first use.

This PR description was created by an AI agent (OpenHands) on behalf of the user.

Follow-up (2026-04-29)

fixed the failing shared metrics regression in 024891f5 by teaching the test harness to patch the module that defines each benchmark's inherited evaluation implementation
added SWE-Bench Pro-specific instance/metadata fixtures to the shared metrics tests
re-ran PYTHONPATH=/workspace/project/prs/pr699 /workspace/project/benchmarks/.venv/bin/python -m pytest tests/test_metrics.py tests/test_swebench_eval_infer.py tests/test_swebenchpro.py tests/test_phased_build.py tests/test_prompt_path.py -q on the updated head (65 passed)
re-ran /workspace/project/benchmarks/.venv/bin/pre-commit run --files tests/test_metrics.py before pushing
addressed the follow-up automated review concern in 69979f72 by making SWE-Bench Pro conversion fail fast on malformed input rows and by validating that the official harness checkout contains swe_bench_pro_eval.py before evaluation
added targeted tests for malformed conversion input and missing harness script failures
re-ran PYTHONPATH=/workspace/project/prs/pr699 /workspace/project/benchmarks/.venv/bin/python -m pytest tests/test_swebenchpro.py tests/test_metrics.py tests/test_swebench_eval_infer.py -q on the updated head (30 passed)
re-ran /workspace/project/benchmarks/.venv/bin/pre-commit run --files benchmarks/swebenchpro/eval_infer.py tests/test_swebenchpro.py tests/test_metrics.py before pushing

Reuse the existing SWE-Bench inference flow via subclass hooks, add benchmark-specific build and evaluation wrappers, and add a CI workflow for building SWE-Bench Pro images. Co-authored-by: openhands <openhands@all-hands.dev>

all-hands-bot

🟡 Acceptable - Clean architecture with good reuse via subclassing. The implementation is sound but needs stronger validation evidence.

[IMPORTANT]

[PR Description] Missing Evidence: The validation section shows --help commands and test runs, but no concrete proof of successful end-to-end inference or evaluation. Per repository standards for CLI/backend changes, please add an Evidence section showing:
- Actual swebenchpro-infer command with real parameters and output (even a single instance with --n-limit 1)
- Actual swebenchpro-eval command and resulting report
- Link to agent conversation if this work came from an agent run

[RISK ASSESSMENT]

Overall PR: ⚠️ Risk Assessment: 🟡 MEDIUM

New feature addition with good architecture and test coverage, but lacks end-to-end validation evidence. Risk factors: complex workflow logic, external harness dependency, and error handling that could mask data issues. Not HIGH because existing functionality is preserved via backward-compatible changes and the subclassing approach provides clean separation.

VERDICT: ✅ Worth merging after addressing evidence requirement

KEY INSIGHT: The subclassing approach with hook methods (get_official_docker_image, extract_custom_tag, should_wrap_instance, get_source_repo_path) elegantly reuses existing SWE-Bench infrastructure while maintaining clean separation for Pro-specific behavior.

Trim overly long SWE-Bench Pro dockerhub tags into stable, registry-safe custom tags so phased image builds succeed on long instance identifiers. Co-authored-by: openhands <openhands@all-hands.dev>

Account for the phased image prefix and target suffix so SWE-Bench Pro custom tags stay below Docker tag limits during final agent image assembly. Co-authored-by: openhands <openhands@all-hands.dev>

Co-authored-by: openhands <openhands@all-hands.dev>

all-hands-bot

🟡 Acceptable - Clean architecture with good reuse via subclassing. The implementation is solid but has one critical error handling issue that must be addressed.

The subclassing pattern to extend SWE-Bench infrastructure for SWE-Bench Pro is well-designed. The hook methods (get_official_docker_image, extract_custom_tag, etc.) provide clean extension points without duplicating logic.

all-hands-bot · 2026-04-29T03:54:02Z

Review Summary

[CRITICAL ISSUES]

[benchmarks/swebenchpro/eval_infer.py, Line 62-85] Error Handling: Silent error handling creates incomplete patch files. Parse errors are logged but instances are skipped without failing, which could cause evaluation to silently skip failed instances. Must fail fast or raise after conversion completes.

[IMPROVEMENT OPPORTUNITIES]

[benchmarks/swebenchpro/eval_infer.py, Line 93-102] Validation: Archive extraction validates directory count but doesn't verify swe_bench_pro_eval.py exists, leading to unclear errors later.
[PR Description] Missing Evidence: The validation section shows test runs and --help commands, but no concrete proof of successful end-to-end inference or evaluation. Per repository standards for CLI/backend changes, please add an Evidence section showing:
- Actual swebenchpro-infer command with real parameters and output (even on a small subset like --n-limit 1)
- Actual swebenchpro-eval command with output showing successful evaluation
- This proves the integration works end-to-end, not just that tests pass

[DESIGN STRENGTHS]

Clean Extension Pattern: The subclassing approach reuses SWE-Bench infrastructure effectively without code duplication
Well-Defined Hooks: Methods like get_official_docker_image, extract_custom_tag, should_wrap_instance, and get_source_repo_path provide clear extension points
Test Coverage: Focused unit tests cover the SWE-Bench Pro-specific logic (image selection, format conversion, report generation)
Metrics Fix: The _get_evaluation_module_name helper correctly handles inherited evaluation methods

[RISK ASSESSMENT]

Overall PR: ⚠️ Risk Assessment: 🟡 MEDIUM

This PR adds a new benchmark evaluation path that processes user-generated patches and runs evaluation in Docker containers. The risk is elevated due to:

Data Integrity: Silent error handling in conversion could produce incomplete results (critical issue above)
External Dependencies: Downloads and executes pinned harness code from GitHub
Docker Execution: Runs evaluation containers with network access by default
New Code Paths: First implementation of SWE-Bench Pro support with limited production validation

The architecture is sound and follows existing patterns, mitigating some risk. The critical error handling issue must be fixed before merge.

Recommendation: Fix the critical error handling issue. Evidence would strengthen confidence in the implementation but is not a blocker given the strong test coverage.

VERDICT:

✅ Worth merging after fix: Core architecture is excellent, clean extension pattern. Must fix the silent error handling before merge. Evidence would be nice to have but not blocking.

KEY INSIGHT:

The subclassing pattern successfully isolates SWE-Bench Pro differences (Docker image tags from dataset, /app vs /testbed paths) while reusing the complex phased build pipeline—exactly the right abstraction level.

Improve this review? If any feedback above seems incorrect or irrelevant to this repository, you can teach the reviewer to do better:

Add a .agents/skills/custom-codereview-guide.md file to your branch (or edit it if one already exists) with the /codereview trigger and the context the reviewer is missing (e.g., "Security concerns about X do not apply here because Y"). See the customization docs for the required frontmatter format.

Re-request a review - the reviewer reads guidelines from the PR branch, so your changes take effect immediately.

When your PR is merged, the guideline file goes through normal code review by repository maintainers.

Resolve with AI? Install the iterate skill in your agent and run /iterate to automatically drive this PR through CI, review, and QA until it's merge-ready.

Co-authored-by: openhands <openhands@all-hands.dev>

all-hands-bot

🟡 Acceptable - Clean architecture with good reuse via subclassing. Previous critical issues have been addressed.

[IMPROVEMENT OPPORTUNITIES]
See inline comment on string concatenation style.

[RISK ASSESSMENT]
⚠️ Risk Assessment: 🟡 MEDIUM

This PR adds new benchmark infrastructure that:

Downloads external code from a pinned GitHub commit (validated post-extraction)
Triggers CI image builds via workflow labels (resource usage)
Extends existing SWE-Bench via clean subclass hooks (no breaking changes)
Has comprehensive test coverage including malformed input handling

Risk is medium due to external dependencies and CI resource implications, but well-controlled through validation, pinned commits, and the purely additive nature of changes.

VERDICT:
✅ Worth merging - Core logic is sound, previous critical issues resolved, minor style improvement suggested

KEY INSIGHT:
The subclass hook pattern (get_official_docker_image, extract_custom_tag, get_source_repo_path) provides clean extension points that maximize code reuse while supporting benchmark-specific requirements.

Improve this review? If any feedback above seems incorrect or irrelevant to this repository, you can teach the reviewer to do better:

Add a .agents/skills/custom-codereview-guide.md file to your branch (or edit it if one already exists) with the /codereview trigger and the context the reviewer is missing.

Re-request a review - the reviewer reads guidelines from the PR branch, so your changes take effect immediately.

When your PR is merged, the guideline file goes through normal code review by repository maintainers.

Resolve with AI? Install the iterate skill in your agent and run /iterate to automatically drive this PR through CI, review, and QA until it's merge-ready.

all-hands-bot

🟢 Good taste - Clean architecture with excellent reuse via subclassing.

[IMPROVEMENTS DELIVERED]
The follow-up commits successfully addressed the critical issues from previous reviews:

✅ Error Handling: convert_to_swebenchpro_format now fails fast when encountering malformed input (lines 79-83), preventing silent data loss
✅ Harness Validation: _validate_harness_dir ensures swe_bench_pro_eval.py exists before evaluation (lines 112-119), with validation called both on cache hits and after extraction
✅ Test Coverage: Added tests for malformed input handling and missing harness script scenarios

[KEY INSIGHT]
The Template Method pattern implementation is exemplary - four hook methods (get_official_docker_image, extract_custom_tag, should_wrap_instance, get_source_repo_path) allow SWE-Bench Pro to customize behavior without duplicating the inference/evaluation pipeline. This is exactly how to extend evaluation infrastructure.

[RISK ASSESSMENT]
⚠️ Risk Assessment: 🟡 MEDIUM

This PR adds significant benchmark infrastructure with the following risk factors:

Elevated Risks:

External Code Download: Downloads official harness from GitHub (mitigated by pinned commit SHA + post-extraction validation)
CI Resource Usage: Workflow triggers Docker image builds that can consume significant compute (mitigated by label-based triggering with 50/200/full limits)
New Infrastructure: Adds new benchmark package with custom build/eval pipeline (mitigated by comprehensive unit tests including error cases)

Mitigating Factors:

Clean extension of battle-tested SWE-Bench infrastructure via subclassing
Comprehensive test coverage (8 tests covering happy path, edge cases, and failure modes)
Validation of external dependencies before use
Documentation includes usage instructions and architectural notes
No changes to existing benchmarks (zero regression risk)

Recommendation: Safe to merge. The architecture is solid, previous critical issues have been resolved, and test coverage provides confidence in correctness.

[VERDICT]
✅ Worth merging: Core logic is sound, critical issues resolved, comprehensive testing validates behavior.

Add SWE-Bench Pro benchmark support

7d91985

Reuse the existing SWE-Bench inference flow via subclass hooks, add benchmark-specific build and evaluation wrappers, and add a CI workflow for building SWE-Bench Pro images. Co-authored-by: openhands <openhands@all-hands.dev>

all-hands-bot reviewed Apr 25, 2026

View reviewed changes

Comment thread benchmarks/swebenchpro/eval_infer.py

Comment thread benchmarks/swebenchpro/eval_infer.py

Comment thread .github/workflows/build-swebenchpro-images.yml

neubig mentioned this pull request Apr 25, 2026

Add SWE-Bench Pro to run-eval workflow OpenHands/software-agent-sdk#2959

Open

6 tasks

openhands-agent added 3 commits April 25, 2026 01:52

Shorten SWE-Bench Pro image tags

4b0978c

Trim overly long SWE-Bench Pro dockerhub tags into stable, registry-safe custom tags so phased image builds succeed on long instance identifiers. Co-authored-by: openhands <openhands@all-hands.dev>

Fit SWE-Bench Pro tags within phased image limits

6ccf045

Account for the phased image prefix and target suffix so SWE-Bench Pro custom tags stay below Docker tag limits during final agent image assembly. Co-authored-by: openhands <openhands@all-hands.dev>

Fix SWE-Bench Pro metrics regression test

024891f

Co-authored-by: openhands <openhands@all-hands.dev>

neubig requested a review from all-hands-bot April 29, 2026 03:50

all-hands-bot reviewed Apr 29, 2026

View reviewed changes

Comment thread benchmarks/swebenchpro/eval_infer.py

Comment thread benchmarks/swebenchpro/eval_infer.py

Fail fast on malformed SWE-Bench Pro eval inputs

69979f7

Co-authored-by: openhands <openhands@all-hands.dev>

neubig requested a review from all-hands-bot April 29, 2026 04:07

all-hands-bot reviewed Apr 29, 2026

View reviewed changes

Comment thread benchmarks/swebenchpro/eval_infer.py

neubig requested a review from all-hands-bot April 29, 2026 12:35

all-hands-bot approved these changes Apr 29, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add SWE-Bench Pro benchmark support#699

Add SWE-Bench Pro benchmark support#699
neubig wants to merge 5 commits intomainfrom
add-swebench-pro

neubig commented Apr 25, 2026 •

edited

Loading

Uh oh!

all-hands-bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

all-hands-bot left a comment

Uh oh!

Uh oh!

Uh oh!

all-hands-bot commented Apr 29, 2026

Uh oh!

all-hands-bot left a comment

Uh oh!

Uh oh!

all-hands-bot left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

neubig commented Apr 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Validation

Notes

Follow-up (2026-04-29)

Uh oh!

all-hands-bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

all-hands-bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

all-hands-bot commented Apr 29, 2026

Review Summary

[CRITICAL ISSUES]

[IMPROVEMENT OPPORTUNITIES]

[DESIGN STRENGTHS]

[RISK ASSESSMENT]

VERDICT:

KEY INSIGHT:

Uh oh!

all-hands-bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

all-hands-bot left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

neubig commented Apr 25, 2026 •

edited

Loading