resolver/auth: prevent deadlocks on disconnected clients and narrow authorizer lock scope [1/3]#6630
resolver/auth: prevent deadlocks on disconnected clients and narrow authorizer lock scope [1/3]#6630glightfoot wants to merge 3 commits intomoby:masterfrom
Conversation
Signed-off-by: Greg Lightfoot <greg.lightfoot@reddit.com>
Signed-off-by: Greg Lightfoot <greg.lightfoot@reddit.com>
tonistiigi
left a comment
There was a problem hiding this comment.
I'm not convinced this is a minimal fix for the described issue.
| grpc.StreamInterceptor(grpcerrors.StreamServerInterceptor), | ||
| grpc.MaxRecvMsgSize(defaults.DefaultMaxRecvMsgSize), | ||
| grpc.MaxSendMsgSize(defaults.DefaultMaxSendMsgSize), | ||
| // Keepalive configuration to detect and close dead client connections |
There was a problem hiding this comment.
How is this related to "auth deadlock" if that request is running in a different grpc server (this is control server, not session).
|
@tonistiigi unfortunately you are correct as this alone wasn't enough. This was the first fix we tried, but ended up with context deadline exceeded errors on the clients (but reduced deadlocks). The other linked PR that refactors the session manager resulted in the complete resolution of our deadlock problems. We were experiencing the deadlocks very frequently with the main build (10's of times per day). With all the associated PR's, we are now experiencing zero deadlocks. None of us that worked on these fixes is familiar with the codebase at all, and a lot of this was guided by AI, so I am sure this could be improved. We are happy to help address any feedback and test anything in our environment, so just let me know how we can help. |
Signed-off-by: Greg Lightfoot <greg.lightfoot@reddit.com>
In that case, I'd suggest you to work on providing runnable reproducers and/or integration tests that fail because of this issue and leave the actual patches to others. |
|
@tonistiigi I opened a draft PR with some failing tests for the auth deadlock #6638 |
Summary
This branch fixes a deadlock class in
buildkitdthat can occur when a client disconnects unexpectedly during registry auth callbacks, and refactors authorizer middleware locking so unrelated auth flows are not serialized behind slow/hung requests.We discovered this problem when buildkit nodes would get stuck and need to be restarted. We took goroutine dumps of the builder while stuck and discovered the deadlock in the auth flow.
We have been running this change in our internal fork for a couple weeks now and it seems to have completely resolved the issue. The rest of the changes we made are included in #6631 and #6632
Problem
When image resolution gets
401 Unauthorized, BuildKit calls back to the client session (VerifyTokenAuthority) to obtain/verify credentials. If the client silently disappears (half-open TCP, blackholed network, abrupt termination), that callback can hang long enough to block resolver progress.In the wedged state:
flightcontrolRoot Cause
Auth/session calls were in paths that could hold shared synchronization for too long under failure conditions, and dead transport detection was not bounded quickly enough for silent disconnects.
Changes
Deadlock failsafes (
add timeouts to break deadlocks)buildkitdto proactively detect and close dead idle peers.Authorizer lock refactor (
refactor authorizer middleware locking)AuthorizeandAddResponses.fetcherNS/fetcherState).flightcontrolrather than serializing all hosts/sessions.TestAuthorizeDoesNotGloballyBlockOnSlowAuthFetch)Fixes #6633