Skip to content

resolver/auth: prevent deadlocks on disconnected clients and narrow authorizer lock scope [1/3]#6630

Open
glightfoot wants to merge 3 commits intomoby:masterfrom
glightfoot:fix-deadlocks
Open

resolver/auth: prevent deadlocks on disconnected clients and narrow authorizer lock scope [1/3]#6630
glightfoot wants to merge 3 commits intomoby:masterfrom
glightfoot:fix-deadlocks

Conversation

@glightfoot
Copy link
Copy Markdown

@glightfoot glightfoot commented Mar 30, 2026

Summary

This branch fixes a deadlock class in buildkitd that can occur when a client disconnects unexpectedly during registry auth callbacks, and refactors authorizer middleware locking so unrelated auth flows are not serialized behind slow/hung requests.

We discovered this problem when buildkit nodes would get stuck and need to be restarted. We took goroutine dumps of the builder while stuck and discovered the deadlock in the auth flow.

We have been running this change in our internal fork for a couple weeks now and it seems to have completely resolved the issue. The rest of the changes we made are included in #6631 and #6632

Problem

When image resolution gets 401 Unauthorized, BuildKit calls back to the client session (VerifyTokenAuthority) to obtain/verify credentials. If the client silently disappears (half-open TCP, blackholed network, abrupt termination), that callback can hang long enough to block resolver progress.

In the wedged state:

  • one auth request hangs waiting on a dead client
  • other requests block on resolver/authorizer lock paths
  • waiters pile up in flightcontrol
  • daemon appears healthy but affected builds stop progressing until restart

Root Cause

Auth/session calls were in paths that could hold shared synchronization for too long under failure conditions, and dead transport detection was not bounded quickly enough for silent disconnects.

Changes

Deadlock failsafes (add timeouts to break deadlocks)

  • Add bounded timeouts around auth/session-dependent operations so dead client callbacks cannot block indefinitely.
  • Add server-side gRPC keepalive parameters in buildkitd to proactively detect and close dead idle peers.

Authorizer lock refactor (refactor authorizer middleware locking)

  • Remove broad/global authorizer critical sections from Authorize and AddResponses.
  • Scope locking to short in-memory fetcher map operations (fetcherNS/fetcherState).
  • Add per-key deduped fetcher lookup using flightcontrol rather than serializing all hosts/sessions.
  • Update resolver pool GC to avoid holding fetcher locks while doing potentially slow session-manager calls (collect -> validate -> conditional delete).
  • Add regression coverage that verifies unrelated auth requests are not blocked by a slow bearer token fetch (TestAuthorizeDoesNotGloballyBlockOnSlowAuthFetch)

Fixes #6633

Greg Lightfoot added 2 commits March 30, 2026 10:20
Signed-off-by: Greg Lightfoot <greg.lightfoot@reddit.com>
Signed-off-by: Greg Lightfoot <greg.lightfoot@reddit.com>
@glightfoot glightfoot changed the title resolver/auth: prevent deadlocks on disconnected clients and narrow authorizer lock scope resolver/auth: prevent deadlocks on disconnected clients and narrow authorizer lock scope [1/2] Mar 30, 2026
@glightfoot glightfoot changed the title resolver/auth: prevent deadlocks on disconnected clients and narrow authorizer lock scope [1/2] resolver/auth: prevent deadlocks on disconnected clients and narrow authorizer lock scope [1/3] Mar 30, 2026
@glightfoot glightfoot marked this pull request as ready for review March 30, 2026 18:44
Copy link
Copy Markdown
Member

@tonistiigi tonistiigi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not convinced this is a minimal fix for the described issue.

grpc.StreamInterceptor(grpcerrors.StreamServerInterceptor),
grpc.MaxRecvMsgSize(defaults.DefaultMaxRecvMsgSize),
grpc.MaxSendMsgSize(defaults.DefaultMaxSendMsgSize),
// Keepalive configuration to detect and close dead client connections
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How is this related to "auth deadlock" if that request is running in a different grpc server (this is control server, not session).

@glightfoot
Copy link
Copy Markdown
Author

glightfoot commented Mar 30, 2026

@tonistiigi unfortunately you are correct as this alone wasn't enough. This was the first fix we tried, but ended up with context deadline exceeded errors on the clients (but reduced deadlocks). The other linked PR that refactors the session manager resulted in the complete resolution of our deadlock problems.

We were experiencing the deadlocks very frequently with the main build (10's of times per day). With all the associated PR's, we are now experiencing zero deadlocks.

None of us that worked on these fixes is familiar with the codebase at all, and a lot of this was guided by AI, so I am sure this could be improved. We are happy to help address any feedback and test anything in our environment, so just let me know how we can help.

Signed-off-by: Greg Lightfoot <greg.lightfoot@reddit.com>
@tonistiigi
Copy link
Copy Markdown
Member

None of us that worked on these fixes is familiar with the codebase at all

In that case, I'd suggest you to work on providing runnable reproducers and/or integration tests that fail because of this issue and leave the actual patches to others.

@glightfoot
Copy link
Copy Markdown
Author

@tonistiigi I opened a draft PR with some failing tests for the auth deadlock #6638

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Deadlocks in auth to repositories

2 participants