fix(commit0): fix ACP packages and switch to ubuntu-latest-8core runner#693
Merged
simonrosenberg merged 13 commits intomainfrom Apr 23, 2026
Merged
fix(commit0): fix ACP packages and switch to ubuntu-latest-8core runner#693simonrosenberg merged 13 commits intomainfrom
simonrosenberg merged 13 commits intomainfrom
Conversation
@google/gemini-cli 0.39.0 was published on 2026-04-23 (the same day commit0 image builds started failing). Builds previously relied on unpinned npm install of all three ACP CLIs, so any breaking release would silently become the new default. Pin to the last known-good versions: - @zed-industries/claude-agent-acp@0.23.1 (published 2026-03-26) - @zed-industries/codex-acp@0.11.1 (published 2026-03-31) - @google/gemini-cli@0.38.0 (published 2026-04-12) The Dockerfile content hash in agent_layer_content_hash() will change, so existing registry images are automatically invalidated and rebuilt on the next eval run. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Collaborator
all-hands-bot
left a comment
There was a problem hiding this comment.
🟡 Acceptable - Simple, focused fix that directly addresses the root cause.
The version pinning approach is correct. One gap: PR description lacks concrete evidence that the build succeeds with these versions.
Collaborator
|
[RISK ASSESSMENT] This is a straightforward dependency pinning change with minimal risk: ✅ Low-risk factors:
Recommendation: Safe to merge. The fix is correct and minimal. Consider adding build success evidence for documentation purposes, but not blocking. |
The build has been failing at ~8 min with no logs (all post-build steps show empty conclusion, including if:always() archive). This means the runner is killed before any Python-level output can flush. Two changes to surface the actual error: 1. _assemble_commit0_image: replace run_docker_build_layer (which buffers all docker output via capture_output=True) with a direct subprocess.run(cmd) call. ProcessPoolExecutor workers inherit fd 1/2 from the parent, so without capture_output the docker build streams directly to the GH Actions log in real-time — visible even if the runner is subsequently killed. 2. Disk space logging via os.write(2, ...) before/after each image build and in the main assembly loop. os.write bypasses capture_output's Python-level redirect so it always reaches the GH Actions log. 3. Workflow: add a pre-flight "disk and Docker status" step (df -h, free -h, docker system df) before the build starts. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
ubuntu-24.04 runners have ~14 GiB free disk. Building 16 commit0 images cold (each with a large /agent-server layer) fills the BuildKit content store during the export/push phase and kills the runner — same root cause as swtbench/swebench (fixed in PR #690), but on a smaller runner. Two prune points: 1. Pre-assembly: docker buildx prune -af before starting the image loop, clearing cache from the builder-image build phase. 2. Post-push: after each successful image push, run docker rmi + system prune + builder prune --keep-storage 8g to prevent cumulative disk exhaustion across 16 sequential+concurrent builds. The npm version pinning (previous commit) was also necessary — it fixed the earlier 8-minute failure — but the disk cleanup is needed to get all 16 images through the export phase on a 14 GiB runner. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
The previous approach (per-image prune from worker processes + pre-assembly buildx prune) had two problems: 1. docker buildx prune -af before assembly cleared the builder image from BuildKit cache, forcing all 4 workers to re-pull ~2 GiB simultaneously → immediate disk spike → runner killed at ~9 min. 2. docker builder prune from concurrent workers races against sibling builds that still need the cache being pruned. Fix: process images in batches of max_workers. All workers in a batch finish before the next starts. The main process then prunes (docker system prune -f + docker builder prune --keep-storage 8g) safely, with no active builds competing for the cache. Shared layers (builder, Node.js, npm, /agent-server) stay within the 8 GiB keep-storage budget and are reused across batches; only the per-image base layer is re-pulled each batch. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
ubuntu-24.04 runners have ~14 GiB free disk. With max_workers=4, a batch of 4 concurrent builds peaks at ~12 GiB (4 base images × ~1.5 GiB uncompressed + ~5 GiB shared layers) before the batch completes and the between-batch prune can run — leaving no headroom and killing the runner. With max_workers=1, peak disk per image is ~6 GiB (1 base image + shared layers), well within the 14 GiB limit. Shared layers (builder, Node.js, npm, /agent-server) stay cached at ≤8 GiB between images; only the per-image base layer (~300 MB compressed) is re-pulled each time. 16 lite-split images complete in ~60-70 min, well within the 600-min timeout. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
ubuntu-24.04 (2-core, 7 GiB RAM, ~14 GiB free disk) is too small for building commit0 images: the builder image pull alone (~2-3 GiB compressed, larger uncompressed) consumes most of the available disk/RAM before the first image even finishes, killing the runner. Switch to ubuntu-latest-8core, the same runner swtbench already uses, which has sufficient disk and RAM for multi-image builds. Restore max_workers=4 since the runner size was the constraint, not concurrency. The between-batch pruning keeps cumulative disk usage bounded. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…e runner Previous failed runs accumulated BuildKit cache on the sticky runner, leaving it with insufficient disk even before the build starts (3m31s failure). Add a preflight prune (--keep-storage 30g) matching the swebench workflow, which clears leftover data from prior runs. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Switch _assemble_commit0_image from docker buildx build --push to the same pattern swebench uses: docker build (loads into local daemon) → docker push → docker rmi → docker system prune. docker buildx --push accumulates data in BuildKit's content store which is hard to clean up during concurrent builds and caused runner OOM/disk kills. The local daemon approach frees disk immediately after each push via docker rmi + system prune, keeping disk usage flat across all images. Also remove all the batching/pruning complexity added during debugging — it's no longer needed since cleanup is handled per-image. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…690) DOCKER_BUILDKIT=1 makes docker build use the daemon's embedded BuildKit, which accumulates in /var/lib/docker/buildkit/. Without pruning it, the cache grows unboundedly across images. Add docker builder prune -af --keep-storage 30g after each successful push, matching exactly what swebench's assemble_agent_image does in PR #690. Also add docker builder prune to the preflight step to clear the embedded BuildKit cache from previous runs (in addition to the existing buildx container prune). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
The original buildx --push code works fine on a larger runner. The complexity added during debugging is not needed. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
8 concurrent buildx builds overwhelm the single BuildKit container. The original value of 4 worked; keep it until the larger runner is confirmed stable. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
The commit0 Dockerfile was using the deprecated @zed-industries/ claude-agent-acp package, while the SDK Dockerfile uses the canonical @agentclientprotocol/claude-agent-acp. This means commit0 ACP images had an incompatible claude-agent-acp (0.23.1, 7 versions behind 0.30.0) that will never receive updates. Switch to the same package and version the SDK uses. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Root causes
Two separate issues caused commit0 image builds to fail consistently from 2026-04-23.
1. Wrong / deprecated claude-agent-acp package
The commit0 Dockerfile was installing
@zed-industries/claude-agent-acpwhile the SDK Dockerfile uses@agentclientprotocol/claude-agent-acp— the canonical package that@zed-industries/claude-agent-acpwas renamed to. This was confirmed by the deprecation warning seen in build logs:Using the deprecated package meant commit0 ACP images were 7 minor versions behind the SDK (0.23.1 vs 0.30.0) and would never receive future updates, risking protocol incompatibility with the SDK.
2. Unpinned
@google/gemini-clipicked up breaking 0.39.0@google/gemini-cli0.39.0 was published on 2026-04-23 (same day as the failures). It introduced bundled ripgrep binaries via Node.js SEA, making the package 89 MB unpacked. With no version pin, every cold build silently picked up the new version.3. Runner too small for cold builds
The
ubuntu-24.04runner (2-core, ~14 GiB free disk) was too constrained for building commit0 images when no cached images existed in the registry. Theubuntu-latest-8corerunner (8-core, 31 GiB RAM, 237 GiB free disk) used by swtbench handles this without issue.Fix
Dockerfile.agent-layer-commit0:@zed-industries/claude-agent-acp→@agentclientprotocol/claude-agent-acp@0.30.0(matches SDK)@zed-industries/codex-acp@0.11.1(unchanged package, pinned for stability)@google/gemini-cli@0.38.0(last known-good version before 0.39.0)build-commit0-images.yml:ubuntu-24.04toubuntu-latest-8coredocker buildx prune+docker builder prune+docker system pruneto clear accumulated cache from previous runs on the sticky runnerMAX_WORKERS: '4'(original value, works correctly on the larger runner)Validation
Build run 24861799673 completed successfully in ~10 minutes with all 16 lite-split images built and pushed.
AI disclosure
This PR was prepared by Claude Sonnet 4.6 on behalf of @simonrosenberg.