Skip to content

fix #675: Modify commit0 initial message to prevent cheating#676

Merged
juanmichelini merged 2 commits intomainfrom
openhands/fix-issue-675-commit0-anti-cheating
Apr 21, 2026
Merged

fix #675: Modify commit0 initial message to prevent cheating#676
juanmichelini merged 2 commits intomainfrom
openhands/fix-issue-675-commit0-anti-cheating

Conversation

@juanmichelini
Copy link
Copy Markdown
Collaborator

Summary

Modify the commit0 initial prompt template (benchmarks/commit0/prompts/default.j2) to indicate that implementations should be created from scratch without copying or looking at similar code online.

Change

Added an anti-cheating instruction to the default.j2 template:

Implement the solution from scratch. Do NOT use git clone, pip install, npm install, or
any other method to obtain the target package/library from external sources. The
code must be written entirely by you without copying or looking at similar code online.

This prevents agents from using git clone, pip install, npm install, or other methods to obtain the target package from external sources when tasked to implement something.

Fixes

Fixes #675

@juanmichelini can click here to continue refining the PR

Modify the default.j2 prompt template to indicate that implementations should
be created from scratch without copying or looking at similar code online.
This prevents agents from using git clone, pip install, npm install, or other
methods to obtain the target package from external sources.

Fixes #675
@VascoSch92
Copy link
Copy Markdown
Collaborator

@OpenHands Look at the CI. Why the tests are failing? Don't modify the code jsut report here your finding.

@openhands-ai
Copy link
Copy Markdown

openhands-ai Bot commented Apr 20, 2026

I'm on it! VascoSch92 can track my progress at all-hands.dev

Copy link
Copy Markdown
Collaborator

CI Failure Analysis

The CI is failing due to an unrelated test failure in tests/test_multimodal_phased_build.py::TestMultimodalParser::test_defaults.

Root Cause

The test test_defaults in tests/test_multimodal_phased_build.py (line 290) expects args.select to be None by default:

assert args.select is None

However, the actual value is:

'/home/runner/work/benchmarks/benchmarks/benchmarks/swebenchmultimodal/resolved_instances.txt'

This indicates that a recent change to the benchmarks/swebenchmultimodal/build_images.py module updated the default value for the --select argument to point to the resolved_instances.txt file, but the corresponding test was not updated to reflect this new default behavior.

Relevance to This PR

This failure is NOT caused by the changes in this PR (which only modifies benchmarks/commit0/prompts/default.j2). The test failure is a pre-existing issue in the codebase related to SWE-bench Multimodal image building, not commit0 prompts.

Summary

  • ✅ Pre-commit checks passed
  • ❌ Run tests failed (1 failing test)
  • The failing test is unrelated to this PR's changes

This analysis was created by an AI assistant (OpenHands) on behalf of the user.

@openhands-ai
Copy link
Copy Markdown

openhands-ai Bot commented Apr 20, 2026

Summary

I analyzed the CI failure for PR #676 as requested and reported my findings without making any code changes.

Answer to the Question: Why are the tests failing?

The CI is failing due to an unrelated test failure in tests/test_multimodal_phased_build.py::TestMultimodalParser::test_defaults.

Root Cause: The test expects args.select to be None by default, but a recent change to benchmarks/swebenchmultimodal/build_images.py updated the default value for the --select argument to point to resolved_instances.txt. The test was not updated to reflect this change.

Key Finding: This failure is NOT caused by this PR's changes (which only modify benchmarks/commit0/prompts/default.j2). It's a pre-existing issue in the main branch related to SWE-bench Multimodal image building.

CI Status

  • ✅ Pre-commit checks: passed
  • ❌ Run tests: 1 failing test (unrelated to PR)
  • ⏸️ PR Review: skipped

Actions Taken

@juanmichelini
Copy link
Copy Markdown
Collaborator Author

juanmichelini commented Apr 21, 2026

@VascoSch92
this prompt did better, but not perfect, tested it with opus 4-6:

Previous (22098671747) | 7/16 (babel, chardet, cookiecutter, imapclient, jinja, parsel, voluptuous)
New (24683071050) | 1/16 (babel only) Agent did pip install babel to get data, not code, but still.

Score change: 2 instances from 9/16 to 7/16.

Rerunning with new prompt

@juanmichelini
Copy link
Copy Markdown
Collaborator Author

@VascoSch92 new prompt got 0 cheating

https://results.eval.all-hands.dev/commit0/litellm_proxy-anthropic-claude-opus-4-6/24696947511/results.tar.gz

and score is not to bad either 8/16 vs the cheating one that had 9/16

@juanmichelini juanmichelini marked this pull request as ready for review April 21, 2026 14:19
@juanmichelini juanmichelini merged commit bf02886 into main Apr 21, 2026
2 of 3 checks passed
Copy link
Copy Markdown
Collaborator

@all-hands-bot all-hands-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟢 Good taste - Clean, direct solution that addresses issue #675.

[RISK ASSESSMENT]

  • [Overall PR] ⚠️ Risk Assessment: 🟢 LOW

This change only adds instructional text to a prompt template without modifying any code logic. It's backward compatible and cannot break existing functionality. The worst-case scenario is that agents ignore the instructions, which is the current state.

VERDICT:
Worth merging: Clear anti-cheating instructions that improve benchmark integrity.

KEY INSIGHT:
Simple, pragmatic solution—adds explicit instructions without over-engineering the problem.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Modify commit0 initial message to prevent cheating

4 participants