bridge: add guest-side reconnect loop for live migration#2698
Open
shreyanshjain7174 wants to merge 1 commit intomicrosoft:mainfrom
Open
bridge: add guest-side reconnect loop for live migration#2698shreyanshjain7174 wants to merge 1 commit intomicrosoft:mainfrom
shreyanshjain7174 wants to merge 1 commit intomicrosoft:mainfrom
Conversation
jterry75
reviewed
Apr 22, 2026
jterry75
reviewed
Apr 22, 2026
dbc66f1 to
05c7170
Compare
During live migration the vsock connection between the host and the GCS breaks when the VM moves to the destination node. The GCS bridge drops and cannot recover, leaving the guest unable to communicate with the new host. This adds a reconnect loop in cmd/gcs/main.go that re-dials the bridge after a connection loss. On each iteration a fresh Bridge and Mux are created while the Host state (containers, processes) persists across reconnections. A Publisher abstraction is added to bridge/publisher.go so that container wait goroutines spawned during CreateContainer can route exit notifications through the current bridge. When the bridge is down between reconnect iterations, notifications are dropped with a warning — the host-side shim re-queries container state after reconnecting. The defer ordering in ListenAndServe is fixed so that quitChan closes before responseChan becomes invalid, and responseChan is buffered to prevent PublishNotification from panicking on a dead bridge. Tested with Invoke-FullLmTestCycle on a two-node Hyper-V live migration setup (Node_1 -> Node_2). Migration completes at 100% and container exec works on the destination node after migration. Signed-off-by: Shreyansh Sancheti <shsancheti@microsoft.com>
05c7170 to
5fafdf4
Compare
jterry75
reviewed
Apr 24, 2026
|
|
||
| // Publisher is a stable notification sink that survives bridge recreation | ||
| // during live migration. | ||
| Publisher *Publisher |
jterry75
reviewed
Apr 24, 2026
| func (p *Publisher) Publish(n *prot.ContainerNotification) { | ||
| p.mu.Lock() | ||
| defer p.mu.Unlock() | ||
| if p.b == nil { |
Contributor
There was a problem hiding this comment.
Can you just use a queue here? On Publish if nil, append, else drain and publish this one. on SetBridge, drain.
Contributor
There was a problem hiding this comment.
If we drop exit events the shim will never understand that state I dont think
jterry75
requested changes
Apr 24, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Fixes #2669
Problem
During live migration the vsock connection between the host and the GCS (Guest Compute Service) breaks when the UVM moves to the destination node. The bridge inside the GCS drops and cannot recover —
ListenAndServereturns with an I/O error, and the GCS has no way to re-establish communication with the new host.What this does
Wraps the bridge serve call in a reconnect loop in
cmd/gcs/main.go. When the vsock connection drops, the GCS re-dials the host and callsListenAndServeagain on the same Bridge.ListenAndServealready creates fresh channels (responseChan,quitChan) on each call, so the Bridge can be reused across reconnections without resetting any state.The
Host(containers, processes, cgroups) persists across reconnections since it lives outside the Bridge.A
Publisheris added so that container wait goroutines — spawned duringCreateContainerand blocked onc.Wait()— can route exit notifications through whichever bridge is currently active. During the reconnect gap the notification is dropped, which is safe because the host-side shim re-queries container state after reconnecting.Design
No mutating RPCs (CreateContainer, ExecProcess, etc.) are in-flight when migration starts — the LM orchestrator ensures all container setup is complete before initiating migration. The only long-lived handler goroutine during migration is
waitOnProcessV2, which is blocked onselect { case exitCode := <-exitCodeChan }and doesn't touchresponseChanuntil the process exits (through Publisher). This means the Bridge can be safely reused acrossListenAndServecalls without risk of handler goroutines racing on channel state.During live migration the VM is frozen and only wakes up when the destination host shim is ready, so the vsock port should be immediately available. The reconnect loop uses a tight 100ms retry interval rather than exponential backoff.
The defer ordering in
ListenAndServeis fixed soquitChancloses beforeresponseChanbecomes invalid, andresponseChanis buffered to preventPublishNotificationfrom blocking on a dead bridge.Changes
cmd/gcs/main.gointernal/guest/bridge/bridge.goPublisherfield,ShutdownRequested(), fixed defer ordering, bufferedresponseChan, priority select guard inPublishNotificationinternal/guest/bridge/bridge_v2.goPublisher.Publish()internal/guest/bridge/publisher.gointernal/guest/bridge/publisher_test.goTesting
Tested on a two-node Hyper-V live migration setup using the
TwoNodeInfratest module:Invoke-FullLmTestCycle -Verbose— deploys LM agents, creates a UVM with an LCOW container on Node_1, migrates to Node_2, verifies 100% completion on both nodes. Containerlcow-testmigrated with pod sandbox intact.crictl exec— created an LCOW pod with our custom GCS (deployed viarootfs.vhd), started a container, exec'dcat /tmp/test.txtto verify bridge communication works after reconnect.go build,go vet,gofmtclean.