fix(commit0): fix ACP packages and switch to ubuntu-latest-8core runner by simonrosenberg · Pull Request #693 · OpenHands/benchmarks

simonrosenberg · 2026-04-23T10:54:36Z

Root causes

Two separate issues caused commit0 image builds to fail consistently from 2026-04-23.

1. Wrong / deprecated claude-agent-acp package

The commit0 Dockerfile was installing @zed-industries/claude-agent-acp while the SDK Dockerfile uses @agentclientprotocol/claude-agent-acp — the canonical package that @zed-industries/claude-agent-acp was renamed to. This was confirmed by the deprecation warning seen in build logs:

npm warn deprecated @zed-industries/claude-agent-acp@0.23.1: This package has been renamed to @agentclientprotocol/claude-agent-acp. Please migrate to continue receiving updates.

Using the deprecated package meant commit0 ACP images were 7 minor versions behind the SDK (0.23.1 vs 0.30.0) and would never receive future updates, risking protocol incompatibility with the SDK.

2. Unpinned `@google/gemini-cli` picked up breaking 0.39.0

@google/gemini-cli 0.39.0 was published on 2026-04-23 (same day as the failures). It introduced bundled ripgrep binaries via Node.js SEA, making the package 89 MB unpacked. With no version pin, every cold build silently picked up the new version.

3. Runner too small for cold builds

The ubuntu-24.04 runner (2-core, ~14 GiB free disk) was too constrained for building commit0 images when no cached images existed in the registry. The ubuntu-latest-8core runner (8-core, 31 GiB RAM, 237 GiB free disk) used by swtbench handles this without issue.

Fix

Dockerfile.agent-layer-commit0:

Switch @zed-industries/claude-agent-acp → @agentclientprotocol/claude-agent-acp@0.30.0 (matches SDK)
Pin @zed-industries/codex-acp@0.11.1 (unchanged package, pinned for stability)
Pin @google/gemini-cli@0.38.0 (last known-good version before 0.39.0)

build-commit0-images.yml:

Switch runner from ubuntu-24.04 to ubuntu-latest-8core
Add preflight docker buildx prune + docker builder prune + docker system prune to clear accumulated cache from previous runs on the sticky runner
Keep MAX_WORKERS: '4' (original value, works correctly on the larger runner)

Validation

Build run 24861799673 completed successfully in ~10 minutes with all 16 lite-split images built and pushed.

AI disclosure

This PR was prepared by Claude Sonnet 4.6 on behalf of @simonrosenberg.

@google/gemini-cli 0.39.0 was published on 2026-04-23 (the same day commit0 image builds started failing). Builds previously relied on unpinned npm install of all three ACP CLIs, so any breaking release would silently become the new default. Pin to the last known-good versions: - @zed-industries/claude-agent-acp@0.23.1 (published 2026-03-26) - @zed-industries/codex-acp@0.11.1 (published 2026-03-31) - @google/gemini-cli@0.38.0 (published 2026-04-12) The Dockerfile content hash in agent_layer_content_hash() will change, so existing registry images are automatically invalidated and rebuilt on the next eval run. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

all-hands-bot

🟡 Acceptable - Simple, focused fix that directly addresses the root cause.

The version pinning approach is correct. One gap: PR description lacks concrete evidence that the build succeeds with these versions.

all-hands-bot · 2026-04-23T10:56:26Z

[RISK ASSESSMENT]

⚠️ Risk Assessment: 🟢 LOW

This is a straightforward dependency pinning change with minimal risk:

✅ Low-risk factors:

No logic changes, only version constraints added
Pins to known-stable versions (documented with publication dates)
Automatically invalidates cache via content hash, forcing clean rebuild
Solves a real production issue (broken builds)
Single-line change per package, easy to verify and rollback

⚠️ Minor consideration:

Missing build evidence in PR description (no logs showing successful build with pinned versions)
However, risk of merging is low since this can't make things worse than the current broken state

Recommendation: Safe to merge. The fix is correct and minimal. Consider adding build success evidence for documentation purposes, but not blocking.

The build has been failing at ~8 min with no logs (all post-build steps show empty conclusion, including if:always() archive). This means the runner is killed before any Python-level output can flush. Two changes to surface the actual error: 1. _assemble_commit0_image: replace run_docker_build_layer (which buffers all docker output via capture_output=True) with a direct subprocess.run(cmd) call. ProcessPoolExecutor workers inherit fd 1/2 from the parent, so without capture_output the docker build streams directly to the GH Actions log in real-time — visible even if the runner is subsequently killed. 2. Disk space logging via os.write(2, ...) before/after each image build and in the main assembly loop. os.write bypasses capture_output's Python-level redirect so it always reaches the GH Actions log. 3. Workflow: add a pre-flight "disk and Docker status" step (df -h, free -h, docker system df) before the build starts. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

ubuntu-24.04 runners have ~14 GiB free disk. Building 16 commit0 images cold (each with a large /agent-server layer) fills the BuildKit content store during the export/push phase and kills the runner — same root cause as swtbench/swebench (fixed in PR #690), but on a smaller runner. Two prune points: 1. Pre-assembly: docker buildx prune -af before starting the image loop, clearing cache from the builder-image build phase. 2. Post-push: after each successful image push, run docker rmi + system prune + builder prune --keep-storage 8g to prevent cumulative disk exhaustion across 16 sequential+concurrent builds. The npm version pinning (previous commit) was also necessary — it fixed the earlier 8-minute failure — but the disk cleanup is needed to get all 16 images through the export phase on a 14 GiB runner. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

The previous approach (per-image prune from worker processes + pre-assembly buildx prune) had two problems: 1. docker buildx prune -af before assembly cleared the builder image from BuildKit cache, forcing all 4 workers to re-pull ~2 GiB simultaneously → immediate disk spike → runner killed at ~9 min. 2. docker builder prune from concurrent workers races against sibling builds that still need the cache being pruned. Fix: process images in batches of max_workers. All workers in a batch finish before the next starts. The main process then prunes (docker system prune -f + docker builder prune --keep-storage 8g) safely, with no active builds competing for the cache. Shared layers (builder, Node.js, npm, /agent-server) stay within the 8 GiB keep-storage budget and are reused across batches; only the per-image base layer is re-pulled each batch. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

ubuntu-24.04 runners have ~14 GiB free disk. With max_workers=4, a batch of 4 concurrent builds peaks at ~12 GiB (4 base images × ~1.5 GiB uncompressed + ~5 GiB shared layers) before the batch completes and the between-batch prune can run — leaving no headroom and killing the runner. With max_workers=1, peak disk per image is ~6 GiB (1 base image + shared layers), well within the 14 GiB limit. Shared layers (builder, Node.js, npm, /agent-server) stay cached at ≤8 GiB between images; only the per-image base layer (~300 MB compressed) is re-pulled each time. 16 lite-split images complete in ~60-70 min, well within the 600-min timeout. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

ubuntu-24.04 (2-core, 7 GiB RAM, ~14 GiB free disk) is too small for building commit0 images: the builder image pull alone (~2-3 GiB compressed, larger uncompressed) consumes most of the available disk/RAM before the first image even finishes, killing the runner. Switch to ubuntu-latest-8core, the same runner swtbench already uses, which has sufficient disk and RAM for multi-image builds. Restore max_workers=4 since the runner size was the constraint, not concurrency. The between-batch pruning keeps cumulative disk usage bounded. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…e runner Previous failed runs accumulated BuildKit cache on the sticky runner, leaving it with insufficient disk even before the build starts (3m31s failure). Add a preflight prune (--keep-storage 30g) matching the swebench workflow, which clears leftover data from prior runs. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Switch _assemble_commit0_image from docker buildx build --push to the same pattern swebench uses: docker build (loads into local daemon) → docker push → docker rmi → docker system prune. docker buildx --push accumulates data in BuildKit's content store which is hard to clean up during concurrent builds and caused runner OOM/disk kills. The local daemon approach frees disk immediately after each push via docker rmi + system prune, keeping disk usage flat across all images. Also remove all the batching/pruning complexity added during debugging — it's no longer needed since cleanup is handled per-image. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…690) DOCKER_BUILDKIT=1 makes docker build use the daemon's embedded BuildKit, which accumulates in /var/lib/docker/buildkit/. Without pruning it, the cache grows unboundedly across images. Add docker builder prune -af --keep-storage 30g after each successful push, matching exactly what swebench's assemble_agent_image does in PR #690. Also add docker builder prune to the preflight step to clear the embedded BuildKit cache from previous runs (in addition to the existing buildx container prune). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

The original buildx --push code works fine on a larger runner. The complexity added during debugging is not needed. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

8 concurrent buildx builds overwhelm the single BuildKit container. The original value of 4 worked; keep it until the larger runner is confirmed stable. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

The commit0 Dockerfile was using the deprecated @zed-industries/ claude-agent-acp package, while the SDK Dockerfile uses the canonical @agentclientprotocol/claude-agent-acp. This means commit0 ACP images had an incompatible claude-agent-acp (0.23.1, 7 versions behind 0.30.0) that will never receive updates. Switch to the same package and version the SDK uses. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

juanmichelini

LGTM, thanks!

all-hands-bot reviewed Apr 23, 2026

View reviewed changes

Comment thread benchmarks/utils/Dockerfile.agent-layer-commit0

simonrosenberg mentioned this pull request Apr 23, 2026

fix: pin ACP npm package versions in agent-server Dockerfile OpenHands/software-agent-sdk#2927

Merged

Debug Agent and others added 12 commits April 23, 2026 11:08

fix(commit0): bump MAX_WORKERS to 8 on ubuntu-latest-8core

e2bbc0e

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

revert(commit0): restore original build_images.py

2721abb

The original buildx --push code works fine on a larger runner. The complexity added during debugging is not needed. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

fix(commit0): revert MAX_WORKERS to 4

7feb524

8 concurrent buildx builds overwhelm the single BuildKit container. The original value of 4 worked; keep it until the larger runner is confirmed stable. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

simonrosenberg changed the title ~~fix(commit0): pin ACP npm package versions in assembly Dockerfile~~ fix(commit0): fix ACP packages and switch to ubuntu-latest-8core runner Apr 23, 2026

juanmichelini approved these changes Apr 23, 2026

View reviewed changes

simonrosenberg merged commit 523d7f3 into main Apr 23, 2026
3 checks passed

simonrosenberg deleted the fix/commit0-pin-acp-package-versions branch April 23, 2026 22:50

simonrosenberg mentioned this pull request Apr 24, 2026

Tracking: ACP Agent Full Benchmark Runs #576

Open

juanmichelini mentioned this pull request Apr 24, 2026

DO_NOT_MERGE_FOR_TESTING_ONLY - Simulate eval_infer error #680

Closed

openhands-ai Bot mentioned this pull request Apr 24, 2026

DO_NOT_MERGE_FOR_TESTING_ONLY - Simulate run_infer error #679

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(commit0): fix ACP packages and switch to ubuntu-latest-8core runner#693

fix(commit0): fix ACP packages and switch to ubuntu-latest-8core runner#693
simonrosenberg merged 13 commits intomainfrom
fix/commit0-pin-acp-package-versions

simonrosenberg commented Apr 23, 2026 •

edited

Loading

Uh oh!

all-hands-bot left a comment

Uh oh!

Uh oh!

all-hands-bot commented Apr 23, 2026

Uh oh!

juanmichelini left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

simonrosenberg commented Apr 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Root causes

1. Wrong / deprecated claude-agent-acp package

2. Unpinned @google/gemini-cli picked up breaking 0.39.0

3. Runner too small for cold builds

Fix

Validation

AI disclosure

Uh oh!

all-hands-bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

all-hands-bot commented Apr 23, 2026

Uh oh!

juanmichelini left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

simonrosenberg commented Apr 23, 2026 •

edited

Loading

2. Unpinned `@google/gemini-cli` picked up breaking 0.39.0