fix #675: Modify commit0 initial message to prevent cheating by juanmichelini · Pull Request #676 · OpenHands/benchmarks

juanmichelini · 2026-04-20T17:32:42Z

Summary

Modify the commit0 initial prompt template (benchmarks/commit0/prompts/default.j2) to indicate that implementations should be created from scratch without copying or looking at similar code online.

Change

Added an anti-cheating instruction to the default.j2 template:

Implement the solution from scratch. Do NOT use git clone, pip install, npm install, or
any other method to obtain the target package/library from external sources. The
code must be written entirely by you without copying or looking at similar code online.

This prevents agents from using git clone, pip install, npm install, or other methods to obtain the target package from external sources when tasked to implement something.

Fixes

Fixes #675

@juanmichelini can click here to continue refining the PR

Modify the default.j2 prompt template to indicate that implementations should be created from scratch without copying or looking at similar code online. This prevents agents from using git clone, pip install, npm install, or other methods to obtain the target package from external sources. Fixes #675

VascoSch92 · 2026-04-20T17:58:42Z

@OpenHands Look at the CI. Why the tests are failing? Don't modify the code jsut report here your finding.

openhands-ai · 2026-04-20T17:59:01Z

I'm on it! VascoSch92 can track my progress at all-hands.dev

VascoSch92 · 2026-04-20T17:59:27Z

CI Failure Analysis

The CI is failing due to an unrelated test failure in tests/test_multimodal_phased_build.py::TestMultimodalParser::test_defaults.

Root Cause

The test test_defaults in tests/test_multimodal_phased_build.py (line 290) expects args.select to be None by default:

assert args.select is None

However, the actual value is:

'/home/runner/work/benchmarks/benchmarks/benchmarks/swebenchmultimodal/resolved_instances.txt'

This indicates that a recent change to the benchmarks/swebenchmultimodal/build_images.py module updated the default value for the --select argument to point to the resolved_instances.txt file, but the corresponding test was not updated to reflect this new default behavior.

Relevance to This PR

This failure is NOT caused by the changes in this PR (which only modifies benchmarks/commit0/prompts/default.j2). The test failure is a pre-existing issue in the codebase related to SWE-bench Multimodal image building, not commit0 prompts.

Summary

✅ Pre-commit checks passed
❌ Run tests failed (1 failing test)
The failing test is unrelated to this PR's changes

This analysis was created by an AI assistant (OpenHands) on behalf of the user.

openhands-ai · 2026-04-20T17:59:44Z

Summary

I analyzed the CI failure for PR #676 as requested and reported my findings without making any code changes.

Answer to the Question: Why are the tests failing?

The CI is failing due to an unrelated test failure in tests/test_multimodal_phased_build.py::TestMultimodalParser::test_defaults.

Root Cause: The test expects args.select to be None by default, but a recent change to benchmarks/swebenchmultimodal/build_images.py updated the default value for the --select argument to point to resolved_instances.txt. The test was not updated to reflect this change.

Key Finding: This failure is NOT caused by this PR's changes (which only modify benchmarks/commit0/prompts/default.j2). It's a pre-existing issue in the main branch related to SWE-bench Multimodal image building.

CI Status

✅ Pre-commit checks: passed
❌ Run tests: 1 failing test (unrelated to PR)
⏸️ PR Review: skipped

Actions Taken

✅ Investigated CI failure using gh pr checks and gh run view --log-failed
✅ Identified the specific failing test and root cause
✅ Posted analysis as a comment on PR fix #675: Modify commit0 initial message to prevent cheating #676
✅ Did NOT modify any code (as instructed)

juanmichelini · 2026-04-21T00:05:07Z

@VascoSch92
this prompt did better, but not perfect, tested it with opus 4-6:

Previous (22098671747) | 7/16 (babel, chardet, cookiecutter, imapclient, jinja, parsel, voluptuous)
New (24683071050) | 1/16 (babel only) Agent did pip install babel to get data, not code, but still.

Score change: 2 instances from 9/16 to 7/16.

Rerunning with new prompt

juanmichelini · 2026-04-21T14:19:13Z

@VascoSch92 new prompt got 0 cheating

https://results.eval.all-hands.dev/commit0/litellm_proxy-anthropic-claude-opus-4-6/24696947511/results.tar.gz

and score is not to bad either 8/16 vs the cheating one that had 9/16

all-hands-bot

🟢 Good taste - Clean, direct solution that addresses issue #675.

[RISK ASSESSMENT]

[Overall PR] ⚠️ Risk Assessment: 🟢 LOW

This change only adds instructional text to a prompt template without modifying any code logic. It's backward compatible and cannot break existing functionality. The worst-case scenario is that agents ignore the instructions, which is the current state.

VERDICT:
✅ Worth merging: Clear anti-cheating instructions that improve benchmark integrity.

KEY INSIGHT:
Simple, pragmatic solution—adds explicit instructions without over-engineering the problem.

openhands-ai Bot mentioned this pull request Apr 20, 2026

Modify commit0 initial message to prevent cheating #675

Closed

VascoSch92 approved these changes Apr 20, 2026

View reviewed changes

Update default.j2

627bec4

juanmichelini marked this pull request as ready for review April 21, 2026 14:19

juanmichelini merged commit bf02886 into main Apr 21, 2026
2 of 3 checks passed

all-hands-bot approved these changes Apr 21, 2026

View reviewed changes

juanmichelini mentioned this pull request Apr 24, 2026

DO_NOT_MERGE_FOR_TESTING_ONLY - Simulate eval_infer error #680

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix #675: Modify commit0 initial message to prevent cheating#676

fix #675: Modify commit0 initial message to prevent cheating#676
juanmichelini merged 2 commits intomainfrom
openhands/fix-issue-675-commit0-anti-cheating

juanmichelini commented Apr 20, 2026

Uh oh!

VascoSch92 commented Apr 20, 2026

Uh oh!

openhands-ai Bot commented Apr 20, 2026

Uh oh!

VascoSch92 commented Apr 20, 2026

Uh oh!

openhands-ai Bot commented Apr 20, 2026

Uh oh!

juanmichelini commented Apr 21, 2026 •

edited

Loading

Uh oh!

juanmichelini commented Apr 21, 2026

Uh oh!

Uh oh!

all-hands-bot left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

juanmichelini commented Apr 20, 2026

Summary

Change

Fixes

Uh oh!

VascoSch92 commented Apr 20, 2026

Uh oh!

openhands-ai Bot commented Apr 20, 2026

Uh oh!

VascoSch92 commented Apr 20, 2026

CI Failure Analysis

Root Cause

Relevance to This PR

Summary

Uh oh!

openhands-ai Bot commented Apr 20, 2026

Summary

Answer to the Question: Why are the tests failing?

CI Status

Actions Taken

Uh oh!

juanmichelini commented Apr 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

juanmichelini commented Apr 21, 2026

Uh oh!

Uh oh!

all-hands-bot left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

juanmichelini commented Apr 21, 2026 •

edited

Loading