Skip to content

fix(commit0): fix ACP packages and switch to ubuntu-latest-8core runner#693

Merged
simonrosenberg merged 13 commits intomainfrom
fix/commit0-pin-acp-package-versions
Apr 23, 2026
Merged

fix(commit0): fix ACP packages and switch to ubuntu-latest-8core runner#693
simonrosenberg merged 13 commits intomainfrom
fix/commit0-pin-acp-package-versions

Conversation

@simonrosenberg
Copy link
Copy Markdown
Collaborator

@simonrosenberg simonrosenberg commented Apr 23, 2026

Root causes

Two separate issues caused commit0 image builds to fail consistently from 2026-04-23.

1. Wrong / deprecated claude-agent-acp package

The commit0 Dockerfile was installing @zed-industries/claude-agent-acp while the SDK Dockerfile uses @agentclientprotocol/claude-agent-acp — the canonical package that @zed-industries/claude-agent-acp was renamed to. This was confirmed by the deprecation warning seen in build logs:

npm warn deprecated @zed-industries/claude-agent-acp@0.23.1: This package has been renamed to @agentclientprotocol/claude-agent-acp. Please migrate to continue receiving updates.

Using the deprecated package meant commit0 ACP images were 7 minor versions behind the SDK (0.23.1 vs 0.30.0) and would never receive future updates, risking protocol incompatibility with the SDK.

2. Unpinned @google/gemini-cli picked up breaking 0.39.0

@google/gemini-cli 0.39.0 was published on 2026-04-23 (same day as the failures). It introduced bundled ripgrep binaries via Node.js SEA, making the package 89 MB unpacked. With no version pin, every cold build silently picked up the new version.

3. Runner too small for cold builds

The ubuntu-24.04 runner (2-core, ~14 GiB free disk) was too constrained for building commit0 images when no cached images existed in the registry. The ubuntu-latest-8core runner (8-core, 31 GiB RAM, 237 GiB free disk) used by swtbench handles this without issue.

Fix

Dockerfile.agent-layer-commit0:

  • Switch @zed-industries/claude-agent-acp@agentclientprotocol/claude-agent-acp@0.30.0 (matches SDK)
  • Pin @zed-industries/codex-acp@0.11.1 (unchanged package, pinned for stability)
  • Pin @google/gemini-cli@0.38.0 (last known-good version before 0.39.0)

build-commit0-images.yml:

  • Switch runner from ubuntu-24.04 to ubuntu-latest-8core
  • Add preflight docker buildx prune + docker builder prune + docker system prune to clear accumulated cache from previous runs on the sticky runner
  • Keep MAX_WORKERS: '4' (original value, works correctly on the larger runner)

Validation

Build run 24861799673 completed successfully in ~10 minutes with all 16 lite-split images built and pushed.

AI disclosure

This PR was prepared by Claude Sonnet 4.6 on behalf of @simonrosenberg.

@google/gemini-cli 0.39.0 was published on 2026-04-23 (the same day
commit0 image builds started failing). Builds previously relied on
unpinned npm install of all three ACP CLIs, so any breaking release
would silently become the new default.

Pin to the last known-good versions:
- @zed-industries/claude-agent-acp@0.23.1  (published 2026-03-26)
- @zed-industries/codex-acp@0.11.1         (published 2026-03-31)
- @google/gemini-cli@0.38.0               (published 2026-04-12)

The Dockerfile content hash in agent_layer_content_hash() will change,
so existing registry images are automatically invalidated and rebuilt
on the next eval run.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Copy link
Copy Markdown
Collaborator

@all-hands-bot all-hands-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 Acceptable - Simple, focused fix that directly addresses the root cause.

The version pinning approach is correct. One gap: PR description lacks concrete evidence that the build succeeds with these versions.

Comment thread benchmarks/utils/Dockerfile.agent-layer-commit0
@all-hands-bot
Copy link
Copy Markdown
Collaborator

[RISK ASSESSMENT]

⚠️ Risk Assessment: 🟢 LOW

This is a straightforward dependency pinning change with minimal risk:

Low-risk factors:

  • No logic changes, only version constraints added
  • Pins to known-stable versions (documented with publication dates)
  • Automatically invalidates cache via content hash, forcing clean rebuild
  • Solves a real production issue (broken builds)
  • Single-line change per package, easy to verify and rollback

⚠️ Minor consideration:

  • Missing build evidence in PR description (no logs showing successful build with pinned versions)
  • However, risk of merging is low since this can't make things worse than the current broken state

Recommendation: Safe to merge. The fix is correct and minimal. Consider adding build success evidence for documentation purposes, but not blocking.

Debug Agent and others added 12 commits April 23, 2026 11:08
The build has been failing at ~8 min with no logs (all post-build steps
show empty conclusion, including if:always() archive). This means the
runner is killed before any Python-level output can flush.

Two changes to surface the actual error:

1. _assemble_commit0_image: replace run_docker_build_layer (which
   buffers all docker output via capture_output=True) with a direct
   subprocess.run(cmd) call. ProcessPoolExecutor workers inherit fd 1/2
   from the parent, so without capture_output the docker build streams
   directly to the GH Actions log in real-time — visible even if the
   runner is subsequently killed.

2. Disk space logging via os.write(2, ...) before/after each image build
   and in the main assembly loop. os.write bypasses capture_output's
   Python-level redirect so it always reaches the GH Actions log.

3. Workflow: add a pre-flight "disk and Docker status" step (df -h,
   free -h, docker system df) before the build starts.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
ubuntu-24.04 runners have ~14 GiB free disk. Building 16 commit0 images
cold (each with a large /agent-server layer) fills the BuildKit content
store during the export/push phase and kills the runner — same root cause
as swtbench/swebench (fixed in PR #690), but on a smaller runner.

Two prune points:
1. Pre-assembly: docker buildx prune -af before starting the image loop,
   clearing cache from the builder-image build phase.
2. Post-push: after each successful image push, run docker rmi + system
   prune + builder prune --keep-storage 8g to prevent cumulative disk
   exhaustion across 16 sequential+concurrent builds.

The npm version pinning (previous commit) was also necessary — it fixed
the earlier 8-minute failure — but the disk cleanup is needed to get
all 16 images through the export phase on a 14 GiB runner.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
The previous approach (per-image prune from worker processes + pre-assembly
buildx prune) had two problems:
1. docker buildx prune -af before assembly cleared the builder image from
   BuildKit cache, forcing all 4 workers to re-pull ~2 GiB simultaneously
   → immediate disk spike → runner killed at ~9 min.
2. docker builder prune from concurrent workers races against sibling
   builds that still need the cache being pruned.

Fix: process images in batches of max_workers. All workers in a batch
finish before the next starts. The main process then prunes (docker system
prune -f + docker builder prune --keep-storage 8g) safely, with no active
builds competing for the cache. Shared layers (builder, Node.js, npm,
/agent-server) stay within the 8 GiB keep-storage budget and are reused
across batches; only the per-image base layer is re-pulled each batch.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
ubuntu-24.04 runners have ~14 GiB free disk. With max_workers=4,
a batch of 4 concurrent builds peaks at ~12 GiB (4 base images ×
~1.5 GiB uncompressed + ~5 GiB shared layers) before the batch
completes and the between-batch prune can run — leaving no headroom
and killing the runner.

With max_workers=1, peak disk per image is ~6 GiB (1 base image +
shared layers), well within the 14 GiB limit. Shared layers (builder,
Node.js, npm, /agent-server) stay cached at ≤8 GiB between images;
only the per-image base layer (~300 MB compressed) is re-pulled each
time. 16 lite-split images complete in ~60-70 min, well within the
600-min timeout.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
ubuntu-24.04 (2-core, 7 GiB RAM, ~14 GiB free disk) is too small for
building commit0 images: the builder image pull alone (~2-3 GiB compressed,
larger uncompressed) consumes most of the available disk/RAM before the
first image even finishes, killing the runner.

Switch to ubuntu-latest-8core, the same runner swtbench already uses,
which has sufficient disk and RAM for multi-image builds. Restore
max_workers=4 since the runner size was the constraint, not concurrency.
The between-batch pruning keeps cumulative disk usage bounded.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…e runner

Previous failed runs accumulated BuildKit cache on the sticky runner,
leaving it with insufficient disk even before the build starts (3m31s
failure). Add a preflight prune (--keep-storage 30g) matching the
swebench workflow, which clears leftover data from prior runs.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Switch _assemble_commit0_image from docker buildx build --push to the
same pattern swebench uses: docker build (loads into local daemon) →
docker push → docker rmi → docker system prune.

docker buildx --push accumulates data in BuildKit's content store which
is hard to clean up during concurrent builds and caused runner OOM/disk
kills. The local daemon approach frees disk immediately after each push
via docker rmi + system prune, keeping disk usage flat across all images.

Also remove all the batching/pruning complexity added during debugging —
it's no longer needed since cleanup is handled per-image.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…690)

DOCKER_BUILDKIT=1 makes docker build use the daemon's embedded BuildKit,
which accumulates in /var/lib/docker/buildkit/. Without pruning it, the
cache grows unboundedly across images.

Add docker builder prune -af --keep-storage 30g after each successful
push, matching exactly what swebench's assemble_agent_image does in
PR #690. Also add docker builder prune to the preflight step to clear
the embedded BuildKit cache from previous runs (in addition to the
existing buildx container prune).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
The original buildx --push code works fine on a larger runner.
The complexity added during debugging is not needed.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
8 concurrent buildx builds overwhelm the single BuildKit container.
The original value of 4 worked; keep it until the larger runner is
confirmed stable.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
The commit0 Dockerfile was using the deprecated @zed-industries/
claude-agent-acp package, while the SDK Dockerfile uses the canonical
@agentclientprotocol/claude-agent-acp. This means commit0 ACP images
had an incompatible claude-agent-acp (0.23.1, 7 versions behind 0.30.0)
that will never receive updates.

Switch to the same package and version the SDK uses.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@simonrosenberg simonrosenberg changed the title fix(commit0): pin ACP npm package versions in assembly Dockerfile fix(commit0): fix ACP packages and switch to ubuntu-latest-8core runner Apr 23, 2026
Copy link
Copy Markdown
Collaborator

@juanmichelini juanmichelini left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks!

@simonrosenberg simonrosenberg merged commit 523d7f3 into main Apr 23, 2026
3 checks passed
@simonrosenberg simonrosenberg deleted the fix/commit0-pin-acp-package-versions branch April 23, 2026 22:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants