Skip to content

Add SWE-Bench Pro benchmark support#699

Open
neubig wants to merge 5 commits intomainfrom
add-swebench-pro
Open

Add SWE-Bench Pro benchmark support#699
neubig wants to merge 5 commits intomainfrom
add-swebench-pro

Conversation

@neubig
Copy link
Copy Markdown
Contributor

@neubig neubig commented Apr 25, 2026

Summary

  • add a new benchmarks/swebenchpro benchmark package that reuses the existing SWE-Bench inference flow via subclass hooks and benchmark-specific build/eval wrappers
  • add SWE-Bench Pro documentation, CLI entrypoints, and a reusable build-swebenchpro-images workflow with PR labels for 50/200/full image builds
  • add focused unit coverage for SWE-Bench Pro image selection and evaluation-format/report generation helpers

Validation

  • uv run pytest tests/test_swebench_eval_infer.py tests/test_swebenchpro.py tests/test_phased_build.py tests/test_prompt_path.py
  • uv run pre-commit run --files README.md pyproject.toml .github/workflows/build-swebenchpro-images.yml benchmarks/swebench/build_base_images.py benchmarks/swebench/run_infer.py benchmarks/swebenchpro/__init__.py benchmarks/swebenchpro/constants.py benchmarks/swebenchpro/config.py benchmarks/swebenchpro/build_images.py benchmarks/swebenchpro/run_infer.py benchmarks/swebenchpro/eval_infer.py benchmarks/swebenchpro/README.md benchmarks/swebenchpro/prompts/default.j2 tests/test_swebenchpro.py
  • uv run swebenchpro-infer --help
  • uv run swebenchpro-eval --help

Notes

  • The new image-build workflow is callable from other repos (including openhands/software-agent-sdk) via workflow_call.
  • The official SWE-Bench Pro harness is invoked through a thin wrapper that downloads a pinned checkout on first use.

This PR description was created by an AI agent (OpenHands) on behalf of the user.

Follow-up (2026-04-29)

  • fixed the failing shared metrics regression in 024891f5 by teaching the test harness to patch the module that defines each benchmark's inherited evaluation implementation
  • added SWE-Bench Pro-specific instance/metadata fixtures to the shared metrics tests
  • re-ran PYTHONPATH=/workspace/project/prs/pr699 /workspace/project/benchmarks/.venv/bin/python -m pytest tests/test_metrics.py tests/test_swebench_eval_infer.py tests/test_swebenchpro.py tests/test_phased_build.py tests/test_prompt_path.py -q on the updated head (65 passed)
  • re-ran /workspace/project/benchmarks/.venv/bin/pre-commit run --files tests/test_metrics.py before pushing
  • addressed the follow-up automated review concern in 69979f72 by making SWE-Bench Pro conversion fail fast on malformed input rows and by validating that the official harness checkout contains swe_bench_pro_eval.py before evaluation
  • added targeted tests for malformed conversion input and missing harness script failures
  • re-ran PYTHONPATH=/workspace/project/prs/pr699 /workspace/project/benchmarks/.venv/bin/python -m pytest tests/test_swebenchpro.py tests/test_metrics.py tests/test_swebench_eval_infer.py -q on the updated head (30 passed)
  • re-ran /workspace/project/benchmarks/.venv/bin/pre-commit run --files benchmarks/swebenchpro/eval_infer.py tests/test_swebenchpro.py tests/test_metrics.py before pushing

Reuse the existing SWE-Bench inference flow via subclass hooks, add benchmark-specific build and evaluation wrappers, and add a CI workflow for building SWE-Bench Pro images.

Co-authored-by: openhands <openhands@all-hands.dev>
Copy link
Copy Markdown
Collaborator

@all-hands-bot all-hands-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 Acceptable - Clean architecture with good reuse via subclassing. The implementation is sound but needs stronger validation evidence.

[IMPORTANT]

  • [PR Description] Missing Evidence: The validation section shows --help commands and test runs, but no concrete proof of successful end-to-end inference or evaluation. Per repository standards for CLI/backend changes, please add an Evidence section showing:
    • Actual swebenchpro-infer command with real parameters and output (even a single instance with --n-limit 1)
    • Actual swebenchpro-eval command and resulting report
    • Link to agent conversation if this work came from an agent run

[RISK ASSESSMENT]

  • Overall PR: ⚠️ Risk Assessment: 🟡 MEDIUM

New feature addition with good architecture and test coverage, but lacks end-to-end validation evidence. Risk factors: complex workflow logic, external harness dependency, and error handling that could mask data issues. Not HIGH because existing functionality is preserved via backward-compatible changes and the subclassing approach provides clean separation.

VERDICT: ✅ Worth merging after addressing evidence requirement

KEY INSIGHT: The subclassing approach with hook methods (get_official_docker_image, extract_custom_tag, should_wrap_instance, get_source_repo_path) elegantly reuses existing SWE-Bench infrastructure while maintaining clean separation for Pro-specific behavior.

Comment thread benchmarks/swebenchpro/eval_infer.py
Comment thread benchmarks/swebenchpro/eval_infer.py
Comment thread .github/workflows/build-swebenchpro-images.yml
Trim overly long SWE-Bench Pro dockerhub tags into stable, registry-safe custom tags so phased image builds succeed on long instance identifiers.

Co-authored-by: openhands <openhands@all-hands.dev>
Account for the phased image prefix and target suffix so SWE-Bench Pro custom tags stay below Docker tag limits during final agent image assembly.

Co-authored-by: openhands <openhands@all-hands.dev>
Co-authored-by: openhands <openhands@all-hands.dev>
@neubig neubig requested a review from all-hands-bot April 29, 2026 03:50
Copy link
Copy Markdown
Collaborator

@all-hands-bot all-hands-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 Acceptable - Clean architecture with good reuse via subclassing. The implementation is solid but has one critical error handling issue that must be addressed.

The subclassing pattern to extend SWE-Bench infrastructure for SWE-Bench Pro is well-designed. The hook methods (get_official_docker_image, extract_custom_tag, etc.) provide clean extension points without duplicating logic.

Comment thread benchmarks/swebenchpro/eval_infer.py
Comment thread benchmarks/swebenchpro/eval_infer.py
@all-hands-bot
Copy link
Copy Markdown
Collaborator

Review Summary

[CRITICAL ISSUES]

  • [benchmarks/swebenchpro/eval_infer.py, Line 62-85] Error Handling: Silent error handling creates incomplete patch files. Parse errors are logged but instances are skipped without failing, which could cause evaluation to silently skip failed instances. Must fail fast or raise after conversion completes.

[IMPROVEMENT OPPORTUNITIES]

  • [benchmarks/swebenchpro/eval_infer.py, Line 93-102] Validation: Archive extraction validates directory count but doesn't verify swe_bench_pro_eval.py exists, leading to unclear errors later.
  • [PR Description] Missing Evidence: The validation section shows test runs and --help commands, but no concrete proof of successful end-to-end inference or evaluation. Per repository standards for CLI/backend changes, please add an Evidence section showing:
    • Actual swebenchpro-infer command with real parameters and output (even on a small subset like --n-limit 1)
    • Actual swebenchpro-eval command with output showing successful evaluation
    • This proves the integration works end-to-end, not just that tests pass

[DESIGN STRENGTHS]

  • Clean Extension Pattern: The subclassing approach reuses SWE-Bench infrastructure effectively without code duplication
  • Well-Defined Hooks: Methods like get_official_docker_image, extract_custom_tag, should_wrap_instance, and get_source_repo_path provide clear extension points
  • Test Coverage: Focused unit tests cover the SWE-Bench Pro-specific logic (image selection, format conversion, report generation)
  • Metrics Fix: The _get_evaluation_module_name helper correctly handles inherited evaluation methods

[RISK ASSESSMENT]

  • Overall PR: ⚠️ Risk Assessment: 🟡 MEDIUM

This PR adds a new benchmark evaluation path that processes user-generated patches and runs evaluation in Docker containers. The risk is elevated due to:

  1. Data Integrity: Silent error handling in conversion could produce incomplete results (critical issue above)
  2. External Dependencies: Downloads and executes pinned harness code from GitHub
  3. Docker Execution: Runs evaluation containers with network access by default
  4. New Code Paths: First implementation of SWE-Bench Pro support with limited production validation

The architecture is sound and follows existing patterns, mitigating some risk. The critical error handling issue must be fixed before merge.

Recommendation: Fix the critical error handling issue. Evidence would strengthen confidence in the implementation but is not a blocker given the strong test coverage.


VERDICT:

Worth merging after fix: Core architecture is excellent, clean extension pattern. Must fix the silent error handling before merge. Evidence would be nice to have but not blocking.

KEY INSIGHT:

The subclassing pattern successfully isolates SWE-Bench Pro differences (Docker image tags from dataset, /app vs /testbed paths) while reusing the complex phased build pipeline—exactly the right abstraction level.


Improve this review? If any feedback above seems incorrect or irrelevant to this repository, you can teach the reviewer to do better:

  1. Add a .agents/skills/custom-codereview-guide.md file to your branch (or edit it if one already exists) with the /codereview trigger and the context the reviewer is missing (e.g., "Security concerns about X do not apply here because Y"). See the customization docs for the required frontmatter format.
  2. Re-request a review - the reviewer reads guidelines from the PR branch, so your changes take effect immediately.
  3. When your PR is merged, the guideline file goes through normal code review by repository maintainers.

Resolve with AI? Install the iterate skill in your agent and run /iterate to automatically drive this PR through CI, review, and QA until it's merge-ready.

Co-authored-by: openhands <openhands@all-hands.dev>
@neubig neubig requested a review from all-hands-bot April 29, 2026 04:07
Copy link
Copy Markdown
Collaborator

@all-hands-bot all-hands-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 Acceptable - Clean architecture with good reuse via subclassing. Previous critical issues have been addressed.

[IMPROVEMENT OPPORTUNITIES]
See inline comment on string concatenation style.

[RISK ASSESSMENT]
⚠️ Risk Assessment: 🟡 MEDIUM

This PR adds new benchmark infrastructure that:

  • Downloads external code from a pinned GitHub commit (validated post-extraction)
  • Triggers CI image builds via workflow labels (resource usage)
  • Extends existing SWE-Bench via clean subclass hooks (no breaking changes)
  • Has comprehensive test coverage including malformed input handling

Risk is medium due to external dependencies and CI resource implications, but well-controlled through validation, pinned commits, and the purely additive nature of changes.

VERDICT:
Worth merging - Core logic is sound, previous critical issues resolved, minor style improvement suggested

KEY INSIGHT:
The subclass hook pattern (get_official_docker_image, extract_custom_tag, get_source_repo_path) provides clean extension points that maximize code reuse while supporting benchmark-specific requirements.


Improve this review? If any feedback above seems incorrect or irrelevant to this repository, you can teach the reviewer to do better:

  1. Add a .agents/skills/custom-codereview-guide.md file to your branch (or edit it if one already exists) with the /codereview trigger and the context the reviewer is missing.
  2. Re-request a review - the reviewer reads guidelines from the PR branch, so your changes take effect immediately.
  3. When your PR is merged, the guideline file goes through normal code review by repository maintainers.

Resolve with AI? Install the iterate skill in your agent and run /iterate to automatically drive this PR through CI, review, and QA until it's merge-ready.

Comment thread benchmarks/swebenchpro/eval_infer.py
@neubig neubig requested a review from all-hands-bot April 29, 2026 12:35
Copy link
Copy Markdown
Collaborator

@all-hands-bot all-hands-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟢 Good taste - Clean architecture with excellent reuse via subclassing.

[IMPROVEMENTS DELIVERED]
The follow-up commits successfully addressed the critical issues from previous reviews:

  • Error Handling: convert_to_swebenchpro_format now fails fast when encountering malformed input (lines 79-83), preventing silent data loss
  • Harness Validation: _validate_harness_dir ensures swe_bench_pro_eval.py exists before evaluation (lines 112-119), with validation called both on cache hits and after extraction
  • Test Coverage: Added tests for malformed input handling and missing harness script scenarios

[KEY INSIGHT]
The Template Method pattern implementation is exemplary - four hook methods (get_official_docker_image, extract_custom_tag, should_wrap_instance, get_source_repo_path) allow SWE-Bench Pro to customize behavior without duplicating the inference/evaluation pipeline. This is exactly how to extend evaluation infrastructure.

[RISK ASSESSMENT]
⚠️ Risk Assessment: 🟡 MEDIUM

This PR adds significant benchmark infrastructure with the following risk factors:

Elevated Risks:

  • External Code Download: Downloads official harness from GitHub (mitigated by pinned commit SHA + post-extraction validation)
  • CI Resource Usage: Workflow triggers Docker image builds that can consume significant compute (mitigated by label-based triggering with 50/200/full limits)
  • New Infrastructure: Adds new benchmark package with custom build/eval pipeline (mitigated by comprehensive unit tests including error cases)

Mitigating Factors:

  • Clean extension of battle-tested SWE-Bench infrastructure via subclassing
  • Comprehensive test coverage (8 tests covering happy path, edge cases, and failure modes)
  • Validation of external dependencies before use
  • Documentation includes usage instructions and architectural notes
  • No changes to existing benchmarks (zero regression risk)

Recommendation: Safe to merge. The architecture is solid, previous critical issues have been resolved, and test coverage provides confidence in correctness.

[VERDICT]
Worth merging: Core logic is sound, critical issues resolved, comprehensive testing validates behavior.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants