bridge: add guest-side reconnect loop for live migration by shreyanshjain7174 · Pull Request #2698 · microsoft/hcsshim

shreyanshjain7174 · 2026-04-21T17:27:06Z

Problem

During live migration the vsock connection between the host and the GCS (Guest Compute Service) breaks when the UVM moves to the destination node. The bridge inside the GCS drops and cannot recover — ListenAndServe returns with an I/O error, and the GCS has no way to re-establish communication with the new host.

What this does

Wraps the bridge serve call in a reconnect loop in cmd/gcs/main.go. When the vsock connection drops, the GCS re-dials the host and calls ListenAndServe again on the same Bridge. ListenAndServe already creates fresh channels (responseChan, quitChan) on each call, so the Bridge can be reused across reconnections without resetting any state.

The Host (containers, processes, cgroups) persists across reconnections since it lives outside the Bridge.

A Publisher is added so that container wait goroutines — spawned during CreateContainer and blocked on c.Wait() — can route exit notifications through whichever bridge is currently active. During the reconnect gap the notification is dropped, which is safe because the host-side shim re-queries container state after reconnecting.

Design

No mutating RPCs (CreateContainer, ExecProcess, etc.) are in-flight when migration starts — the LM orchestrator ensures all container setup is complete before initiating migration. The only long-lived handler goroutine during migration is waitOnProcessV2, which is blocked on select { case exitCode := <-exitCodeChan } and doesn't touch responseChan until the process exits (through Publisher). This means the Bridge can be safely reused across ListenAndServe calls without risk of handler goroutines racing on channel state.

During live migration the VM is frozen and only wakes up when the destination host shim is ready, so the vsock port should be immediately available. The reconnect loop uses a tight 100ms retry interval rather than exponential backoff.

The defer ordering in ListenAndServe is fixed so quitChan closes before responseChan becomes invalid, and responseChan is buffered to prevent PublishNotification from blocking on a dead bridge.

Changes

File	Change
`cmd/gcs/main.go`	Reconnect loop with 100ms retry; Bridge+Mux created once outside the loop
`internal/guest/bridge/bridge.go`	`Publisher` field, `ShutdownRequested()`, fixed defer ordering, buffered `responseChan`, priority select guard in `PublishNotification`
`internal/guest/bridge/bridge_v2.go`	Container wait goroutine uses `Publisher.Publish()`
`internal/guest/bridge/publisher.go`	Mutex-guarded bridge reference swap (40 lines)
`internal/guest/bridge/publisher_test.go`	Tests for nil-bridge drop and bridge-set-publish

Testing

Tested on a two-node Hyper-V live migration setup using the TwoNodeInfra test module:

Invoke-FullLmTestCycle -Verbose — deploys LM agents, creates a UVM with an LCOW container on Node_1, migrates to Node_2, verifies 100% completion on both nodes. Container lcow-test migrated with pod sandbox intact.
Post-migration crictl exec — created an LCOW pod with our custom GCS (deployed via rootfs.vhd), started a container, exec'd cat /tmp/test.txt to verify bridge communication works after reconnect.
go build, go vet, gofmt clean.

During live migration the vsock connection between the host and the GCS breaks when the VM moves to the destination node. The GCS bridge drops and cannot recover, leaving the guest unable to communicate with the new host. This adds a reconnect loop in cmd/gcs/main.go that re-dials the bridge after a connection loss. On each iteration a fresh Bridge and Mux are created while the Host state (containers, processes) persists across reconnections. A Publisher abstraction is added to bridge/publisher.go so that container wait goroutines spawned during CreateContainer can route exit notifications through the current bridge. When the bridge is down between reconnect iterations, notifications are dropped with a warning — the host-side shim re-queries container state after reconnecting. The defer ordering in ListenAndServe is fixed so that quitChan closes before responseChan becomes invalid, and responseChan is buffered to prevent PublishNotification from panicking on a dead bridge. Tested with Invoke-FullLmTestCycle on a two-node Hyper-V live migration setup (Node_1 -> Node_2). Migration completes at 100% and container exec works on the destination node after migration. Signed-off-by: Shreyansh Sancheti <shsancheti@microsoft.com>

jterry75 · 2026-04-24T16:12:42Z

+
+	// Publisher is a stable notification sink that survives bridge recreation
+	// during live migration.
+	Publisher *Publisher


Why exported?

jterry75 · 2026-04-24T16:14:23Z

+func (p *Publisher) Publish(n *prot.ContainerNotification) {
+	p.mu.Lock()
+	defer p.mu.Unlock()
+	if p.b == nil {


Can you just use a queue here? On Publish if nil, append, else drain and publish this one. on SetBridge, drain.

If we drop exit events the shim will never understand that state I dont think

shreyanshjain7174 mentioned this pull request Apr 21, 2026

Adds guest-side GCS changes for V2 shim support #2669

Open

shreyanshjain7174 marked this pull request as ready for review April 21, 2026 17:28

shreyanshjain7174 requested a review from a team as a code owner April 21, 2026 17:28

shreyanshjain7174 requested review from jterry75 and rawahars April 21, 2026 17:29

jterry75 reviewed Apr 22, 2026

View reviewed changes

Comment thread cmd/gcs/main.go

jterry75 reviewed Apr 22, 2026

View reviewed changes

Comment thread cmd/gcs/main.go

shreyanshjain7174 force-pushed the bridge-reconnect-v2 branch from dbc66f1 to 05c7170 Compare April 23, 2026 04:19

shreyanshjain7174 force-pushed the bridge-reconnect-v2 branch from 05c7170 to 5fafdf4 Compare April 24, 2026 06:49

shreyanshjain7174 requested a review from jterry75 April 24, 2026 06:59

jterry75 reviewed Apr 24, 2026

View reviewed changes

jterry75 requested changes Apr 24, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

bridge: add guest-side reconnect loop for live migration#2698

bridge: add guest-side reconnect loop for live migration#2698
shreyanshjain7174 wants to merge 1 commit intomicrosoft:mainfrom
shreyanshjain7174:bridge-reconnect-v2

shreyanshjain7174 commented Apr 21, 2026 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

jterry75 Apr 24, 2026

Uh oh!

jterry75 Apr 24, 2026

Uh oh!

jterry75 Apr 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

shreyanshjain7174 commented Apr 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Problem

What this does

Design

Changes

Testing

Uh oh!

Uh oh!

Uh oh!

jterry75 Apr 24, 2026

Choose a reason for hiding this comment

Uh oh!

jterry75 Apr 24, 2026

Choose a reason for hiding this comment

Uh oh!

jterry75 Apr 24, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

shreyanshjain7174 commented Apr 21, 2026 •

edited

Loading