Conversation
Reuse the existing SWE-Bench inference flow via subclass hooks, add benchmark-specific build and evaluation wrappers, and add a CI workflow for building SWE-Bench Pro images. Co-authored-by: openhands <openhands@all-hands.dev>
all-hands-bot
left a comment
There was a problem hiding this comment.
🟡 Acceptable - Clean architecture with good reuse via subclassing. The implementation is sound but needs stronger validation evidence.
[IMPORTANT]
- [PR Description] Missing Evidence: The validation section shows
--helpcommands and test runs, but no concrete proof of successful end-to-end inference or evaluation. Per repository standards for CLI/backend changes, please add anEvidencesection showing:- Actual
swebenchpro-infercommand with real parameters and output (even a single instance with--n-limit 1) - Actual
swebenchpro-evalcommand and resulting report - Link to agent conversation if this work came from an agent run
- Actual
[RISK ASSESSMENT]
- Overall PR:
⚠️ Risk Assessment: 🟡 MEDIUM
New feature addition with good architecture and test coverage, but lacks end-to-end validation evidence. Risk factors: complex workflow logic, external harness dependency, and error handling that could mask data issues. Not HIGH because existing functionality is preserved via backward-compatible changes and the subclassing approach provides clean separation.
VERDICT: ✅ Worth merging after addressing evidence requirement
KEY INSIGHT: The subclassing approach with hook methods (get_official_docker_image, extract_custom_tag, should_wrap_instance, get_source_repo_path) elegantly reuses existing SWE-Bench infrastructure while maintaining clean separation for Pro-specific behavior.
Trim overly long SWE-Bench Pro dockerhub tags into stable, registry-safe custom tags so phased image builds succeed on long instance identifiers. Co-authored-by: openhands <openhands@all-hands.dev>
Account for the phased image prefix and target suffix so SWE-Bench Pro custom tags stay below Docker tag limits during final agent image assembly. Co-authored-by: openhands <openhands@all-hands.dev>
Co-authored-by: openhands <openhands@all-hands.dev>
all-hands-bot
left a comment
There was a problem hiding this comment.
🟡 Acceptable - Clean architecture with good reuse via subclassing. The implementation is solid but has one critical error handling issue that must be addressed.
The subclassing pattern to extend SWE-Bench infrastructure for SWE-Bench Pro is well-designed. The hook methods (get_official_docker_image, extract_custom_tag, etc.) provide clean extension points without duplicating logic.
Review Summary[CRITICAL ISSUES]
[IMPROVEMENT OPPORTUNITIES]
[DESIGN STRENGTHS]
[RISK ASSESSMENT]
This PR adds a new benchmark evaluation path that processes user-generated patches and runs evaluation in Docker containers. The risk is elevated due to:
The architecture is sound and follows existing patterns, mitigating some risk. The critical error handling issue must be fixed before merge. Recommendation: Fix the critical error handling issue. Evidence would strengthen confidence in the implementation but is not a blocker given the strong test coverage. VERDICT:✅ Worth merging after fix: Core architecture is excellent, clean extension pattern. Must fix the silent error handling before merge. Evidence would be nice to have but not blocking. KEY INSIGHT:The subclassing pattern successfully isolates SWE-Bench Pro differences (Docker image tags from dataset,
|
Co-authored-by: openhands <openhands@all-hands.dev>
all-hands-bot
left a comment
There was a problem hiding this comment.
🟡 Acceptable - Clean architecture with good reuse via subclassing. Previous critical issues have been addressed.
[IMPROVEMENT OPPORTUNITIES]
See inline comment on string concatenation style.
[RISK ASSESSMENT]
This PR adds new benchmark infrastructure that:
- Downloads external code from a pinned GitHub commit (validated post-extraction)
- Triggers CI image builds via workflow labels (resource usage)
- Extends existing SWE-Bench via clean subclass hooks (no breaking changes)
- Has comprehensive test coverage including malformed input handling
Risk is medium due to external dependencies and CI resource implications, but well-controlled through validation, pinned commits, and the purely additive nature of changes.
VERDICT:
✅ Worth merging - Core logic is sound, previous critical issues resolved, minor style improvement suggested
KEY INSIGHT:
The subclass hook pattern (get_official_docker_image, extract_custom_tag, get_source_repo_path) provides clean extension points that maximize code reuse while supporting benchmark-specific requirements.
Improve this review? If any feedback above seems incorrect or irrelevant to this repository, you can teach the reviewer to do better:
- Add a
.agents/skills/custom-codereview-guide.mdfile to your branch (or edit it if one already exists) with the/codereviewtrigger and the context the reviewer is missing.- Re-request a review - the reviewer reads guidelines from the PR branch, so your changes take effect immediately.
- When your PR is merged, the guideline file goes through normal code review by repository maintainers.
Resolve with AI? Install the iterate skill in your agent and run
/iterateto automatically drive this PR through CI, review, and QA until it's merge-ready.
all-hands-bot
left a comment
There was a problem hiding this comment.
🟢 Good taste - Clean architecture with excellent reuse via subclassing.
[IMPROVEMENTS DELIVERED]
The follow-up commits successfully addressed the critical issues from previous reviews:
- ✅ Error Handling:
convert_to_swebenchpro_formatnow fails fast when encountering malformed input (lines 79-83), preventing silent data loss - ✅ Harness Validation:
_validate_harness_dirensuresswe_bench_pro_eval.pyexists before evaluation (lines 112-119), with validation called both on cache hits and after extraction - ✅ Test Coverage: Added tests for malformed input handling and missing harness script scenarios
[KEY INSIGHT]
The Template Method pattern implementation is exemplary - four hook methods (get_official_docker_image, extract_custom_tag, should_wrap_instance, get_source_repo_path) allow SWE-Bench Pro to customize behavior without duplicating the inference/evaluation pipeline. This is exactly how to extend evaluation infrastructure.
[RISK ASSESSMENT]
This PR adds significant benchmark infrastructure with the following risk factors:
Elevated Risks:
- External Code Download: Downloads official harness from GitHub (mitigated by pinned commit SHA + post-extraction validation)
- CI Resource Usage: Workflow triggers Docker image builds that can consume significant compute (mitigated by label-based triggering with 50/200/full limits)
- New Infrastructure: Adds new benchmark package with custom build/eval pipeline (mitigated by comprehensive unit tests including error cases)
Mitigating Factors:
- Clean extension of battle-tested SWE-Bench infrastructure via subclassing
- Comprehensive test coverage (8 tests covering happy path, edge cases, and failure modes)
- Validation of external dependencies before use
- Documentation includes usage instructions and architectural notes
- No changes to existing benchmarks (zero regression risk)
Recommendation: Safe to merge. The architecture is solid, previous critical issues have been resolved, and test coverage provides confidence in correctness.
[VERDICT]
✅ Worth merging: Core logic is sound, critical issues resolved, comprehensive testing validates behavior.
Summary
benchmarks/swebenchprobenchmark package that reuses the existing SWE-Bench inference flow via subclass hooks and benchmark-specific build/eval wrappersbuild-swebenchpro-imagesworkflow with PR labels for 50/200/full image buildsValidation
uv run pytest tests/test_swebench_eval_infer.py tests/test_swebenchpro.py tests/test_phased_build.py tests/test_prompt_path.pyuv run pre-commit run --files README.md pyproject.toml .github/workflows/build-swebenchpro-images.yml benchmarks/swebench/build_base_images.py benchmarks/swebench/run_infer.py benchmarks/swebenchpro/__init__.py benchmarks/swebenchpro/constants.py benchmarks/swebenchpro/config.py benchmarks/swebenchpro/build_images.py benchmarks/swebenchpro/run_infer.py benchmarks/swebenchpro/eval_infer.py benchmarks/swebenchpro/README.md benchmarks/swebenchpro/prompts/default.j2 tests/test_swebenchpro.pyuv run swebenchpro-infer --helpuv run swebenchpro-eval --helpNotes
openhands/software-agent-sdk) viaworkflow_call.This PR description was created by an AI agent (OpenHands) on behalf of the user.
Follow-up (2026-04-29)
024891f5by teaching the test harness to patch the module that defines each benchmark's inherited evaluation implementationPYTHONPATH=/workspace/project/prs/pr699 /workspace/project/benchmarks/.venv/bin/python -m pytest tests/test_metrics.py tests/test_swebench_eval_infer.py tests/test_swebenchpro.py tests/test_phased_build.py tests/test_prompt_path.py -qon the updated head (65 passed)/workspace/project/benchmarks/.venv/bin/pre-commit run --files tests/test_metrics.pybefore pushing69979f72by making SWE-Bench Pro conversion fail fast on malformed input rows and by validating that the official harness checkout containsswe_bench_pro_eval.pybefore evaluationPYTHONPATH=/workspace/project/prs/pr699 /workspace/project/benchmarks/.venv/bin/python -m pytest tests/test_swebenchpro.py tests/test_metrics.py tests/test_swebench_eval_infer.py -qon the updated head (30 passed)/workspace/project/benchmarks/.venv/bin/pre-commit run --files benchmarks/swebenchpro/eval_infer.py tests/test_swebenchpro.py tests/test_metrics.pybefore pushing