feat: node recovery agent by hlts2 · Pull Request #28 · civo/node-agent

hlts2 · 2026-04-13T13:20:20Z

Summary

Add recovery state machine (Healthy → Unhealthy → WaitingReboot) with PhaseUnknown as zero value
Add pkg/health: HealthChecker interface with per-checker thresholds and reason reporting
- NodeReadyChecker (5min), GPUChecker (10min), DiskPressureChecker (30min), CiliumChecker (10min)
- GPU checker reads expected count from nvidia.com/gpu.count label vs allocatable (no static env var)
- Cilium checker monitors NetworkUnavailable condition with CiliumIsUp reason (auto-skips non-Cilium CNI)
- Check() returns (bool, string) — reason is used as metrics label for observability
Add pkg/operation: Executor interface with CivoExecutor (Functional Option Pattern) and NopExecutor default
Add pkg/metrics: Prometheus metrics (health checks with reason, recovery actions, phase, unhealthy duration)
Add pkg/watcher/state: NodePhase enum, NodeState (private fields + getters), StateStore with transition methods
Refactor watcher: replace API polling with Node Informer, state machine reconcile loop
Differentiate reboot wait: standard nodes 10min, GPU nodes 40min
Support multiple node pool IDs (comma-separated) or empty (all nodes)
Remove legacy code: fake.go, inline reboot/check functions, sync.Map tracking, CIVO_NODE_DESIRED_GPU_COUNT

Environment Variables

Variable	Default	Description
`CIVO_NODE_AGENT_MONITOR_ONLY`	`true`	Monitor-only mode (log but don't reboot)
`CIVO_NODE_AGENT_METRICS_PORT`	`9625`	Prometheus metrics HTTP port (1024-65535)
`CIVO_NODE_POOL_ID`	empty	Comma-separated node pool IDs (empty = all nodes)
`CIVO_NODE_REBOOT_WAIT_MINUTES`	`10`	Reboot wait time for standard nodes
`CIVO_GPU_NODE_REBOOT_WAIT_MINUTES`	`40`	Reboot wait time for GPU nodes

Command Line Flags

Flag	Default	Description
`--kubeconfig`	`/etc/rancher/k3s/k3s.yaml`	Path to kubeconfig file (empty for in-cluster config)
`--version`		Print the agent version

Design Decisions

Monitor-only mode (default true): logs recovery actions without executing them
Per-checker thresholds: each checker defines its own unhealthy threshold; reboot triggers when any failed checker exceeds its threshold
GPU detection via label: nvidia.com/gpu.count (static, set by GFD) vs allocatable (dynamic); health.HasGPU also uses label to correctly identify GPU nodes even when all GPUs are unhealthy
NopExecutor as default: prevents nil pointer dereference when no executor is configured
All package-boundary types accessed via interfaces: structs are private, constructors return interfaces
NodeState fields are private: mutations only through StateStore transition methods
Test-only options are unexported: withNowFunc, withNodeLister
Helm chart changes deferred to a separate PR

Not in This PR

Standard node Drain → Replace flow (TODO in code)
HeartbeatChecker
Checker enable/disable via env var
Per-checker RepairAction (None/Reboot/Replace)

Test Plan

go test ./... passes (all 4 packages)
go vet ./... clean
State machine transitions covered: healthy, unhealthy detection, reboot trigger (active/monitor), recovery, reboot retry, GPU mismatch, error handling, stale cleanup
Coverage: health 98%, operation 81%, watcher 80%, total 76%
Verify curl localhost:9625/metrics returns Prometheus metrics in a test cluster
Verify CIVO_NODE_AGENT_MONITOR_ONLY=true logs but does not reboot
Verify CIVO_NODE_AGENT_MONITOR_ONLY=false triggers actual HardRebootInstance API call

… and metrics Introduce a proper recovery state machine (Healthy → Unhealthy → WaitingReboot) with pluggable health checkers, Prometheus metrics, and monitor-only mode support. - Add pkg/health: HealthChecker interface with NodeReady and GPU checkers - Add pkg/operation: Executor interface with CivoExecutor (FOP-based) - Add pkg/metrics: Prometheus metrics (health checks, recovery actions, phase, duration) - Add pkg/watcher/state: NodePhase enum, NodeState (private fields), StateStore - Refactor watcher: replace polling with Node Informer, state machine reconcile loop - Remove legacy code: fake.go, inline reboot/check functions, sync.Map tracking - Update main.go: new env vars (CIVO_NODE_MONITOR_ONLY, CIVO_NODE_UNHEALTHY_THRESHOLD_MINUTES, CIVO_NODE_METRICS_PORT) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…ation - Rename reboot.go to civo.go to better reflect the Civo executor scope - Extract FOP (Option, WithAPIConfig, WithClient) into options.go - Add validation for clusterID, apiKey, and apiURL before Civo client creation Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Signed-off-by: hlts2 <hiroto.funakoshi.hiroto@gmail.com> Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

- Add Threshold() to HealthChecker interface for per-checker thresholds - NodeReady: 10min, GPU: 10min, DiskPressure: 30min - Remove single unhealthyThreshold from watcher; use min threshold of failed checkers - Remove WithUnhealthyThresholdMinutes option and CIVO_NODE_UNHEALTHY_THRESHOLD_MINUTES env var - Refactor main.go: parseUintOrZero, defaultMetricsPort as int Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…ssages in NewWatcher Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…config flag - Remove nodeDesiredGPUCount field, WithDesiredGPUCount option (GPU count is owned by checker) - Remove NewGPUChecker public constructor (only used internally by NewDefaultCheckers) - Add --kubeconfig flag with default /etc/rancher/k3s/k3s.yaml for CP VM Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

- Change NewWatcher signature: nodePoolID string → nodePoolIDs []string - Add buildNodeSelector: empty=all nodes, single=MatchLabels, multiple=In operator - Add parseNodePoolIDs in main.go for comma-separated CIVO_NODE_POOL_ID Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

… var - GPU checker now compares nvidia.com/gpu.count label (expected) vs allocatable (actual) - Auto-skips non-GPU nodes when label is absent - Remove CIVO_NODE_DESIRED_GPU_COUNT env var and NewDefaultCheckers parameter - NewDefaultCheckers() always includes GPU checker (self-determines GPU node) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…tion - NewWatcher signature: (ctx, clusterID, nodePoolIDs, opts...) → (ctx, clusterID, opts...) - Add WithNodePoolIDs option with append semantics (empty is no-op) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

clusterID was only used by the old rebootNode() which moved to the executor. NewWatcher signature simplified: (ctx, clusterID, opts...) → (ctx, opts...) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

The label is static (set by GFD) and correctly identifies GPU nodes even when all GPUs are unhealthy and allocatable drops to 0. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…ttings - CIVO_NODE_MONITOR_ONLY → CIVO_NODE_AGENT_MONITOR_ONLY - CIVO_NODE_METRICS_PORT → CIVO_NODE_AGENT_METRICS_PORT Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

- Standard nodes: reboot wait 10min (CIVO_NODE_REBOOT_WAIT_MINUTES) - GPU nodes: reboot wait 40min (CIVO_GPU_NODE_REBOOT_WAIT_MINUTES) - NodeReady threshold changed from 10min to 5min per TDD - Replace single rebootTimeWindowMinutes with rebootWaitMinutes + gpuRebootWaitMinutes - WaitingReboot phase checks state.IsGPUNode() to select appropriate wait time Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

WithNodePoolIDs now accepts a comma-separated string and parses internally. Remove parseNodePoolIDs from main.go. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Standard nodes should transition to PhaseDrain → PhaseReplace instead of retrying reboot indefinitely. GPU nodes keep reboot-only. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

WithMonitorOnly now accepts a string and parses internally via strconv.ParseBool. Empty or unparsable values preserve the default (true). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Monitors CiliumAgentIsReady node condition. Auto-skips nodes where the condition is absent (Cilium not installed). Threshold: 10min. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…um checker Cilium sets NetworkUnavailable=False with reason CiliumIsUp, not a custom CiliumAgentIsReady condition. Skip if reason is not CiliumIsUp (other CNI). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Check now returns (bool, string) where the string is the reason. Condition-based checkers pass through cond.Reason directly. GPU checker returns a descriptive reason (e.g. "expected 8 but got 7"). The reason is used as the result label in HealthCheckTotal metric. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

GPU node detection belongs in the health package where gpuCountLabel is defined. Removes strconv and corev1 imports from watcher. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…odeSelector - Health: Threshold tests for all checkers, HasGPU tests - Operation: NewCivoExecutor validation tests (empty clusterID, apiKey, apiURL) - Watcher: buildNodeSelector tests (nil, single, multiple) - Coverage: health 84→98%, operation 50→81%, total 71→76% Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Each checker now defines its threshold as a named constant (nodeReadyThreshold, gpuThreshold, diskPressureThreshold, ciliumThreshold). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

- Add NewNopExecutor() that performs no operations - Set as default in defaultOptions - Add nil check in WithExecutor to preserve default when nil is passed Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…abelSelector Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

These options are only used within the watcher package tests, so they don't need to be exported. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

- Rename 'name' to 'description' in test struct fields - Use verb-driven descriptions ("returns", "detects", "skips", etc.) - Use 'test' instead of 'tt' in range loops - Initialize mocks inside t.Run for isolation (operation tests) - Use test.description in t.Run calls Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Use if-init statement to call Phase() once. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

- Metrics server now uses http.Server with Shutdown() on context cancellation - Ignores http.ErrServerClosed on normal shutdown - Fix WithKubernetesClientConfigPath godoc (was copy of WithKubernetesClient) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

cfg may be nil when BuildConfigFromFlags fails. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…y metrics Replace dynamic formatted reasons (e.g. "Expected 8 but got 7") with enumerable constants: GPUCountMatch, GPUCountMismatch, NoAllocatableGPU, NonGPUNode. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

When a node is removed from the cluster, Cleanup only deleted the state entry but left gauge metrics (recovery_phase, unhealthy_duration) with stale values. Now watcher.run() deletes all metric labels for removed nodes before calling Cleanup. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

FormatLabelSelector(nil) returns "<none>" which is an invalid selector. Only apply the label selector option when nodeLabelSelector is non-nil. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

hlts2 changed the title ~~feat: node recovery state machine MVP~~ feat: node recovery state machine Apr 13, 2026

hlts2 and others added 14 commits April 14, 2026 03:52

fix: update godoc comment

abc4545

Signed-off-by: hlts2 <hiroto.funakoshi.hiroto@gmail.com> Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

feat: add check function to detect disk pressure issue

6e66585

Signed-off-by: hlts2 <hiroto.funakoshi.hiroto@gmail.com> Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

fix: update code comment

9773247

Signed-off-by: hlts2 <hiroto.funakoshi.hiroto@gmail.com> Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

fix: replace environment variable names with parameter-level error me…

b1bd9a0

…ssages in NewWatcher Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

refactor: remove unused clusterID from watcher

9c86454

clusterID was only used by the old rebootNode() which moved to the executor. NewWatcher signature simplified: (ctx, clusterID, opts...) → (ctx, opts...) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

fix: use nvidia.com/gpu.count label for hasGPU instead of allocatable

baee4e0

The label is static (set by GFD) and correctly identifies GPU nodes even when all GPUs are unhealthy and allocatable drops to 0. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

fix: rename env vars to CIVO_NODE_AGENT_ prefix for agent-specific se…

3204673

…ttings - CIVO_NODE_MONITOR_ONLY → CIVO_NODE_AGENT_MONITOR_ONLY - CIVO_NODE_METRICS_PORT → CIVO_NODE_AGENT_METRICS_PORT Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

hlts2 force-pushed the feat/node-recovery branch from d8bc10b to 3204673 Compare April 13, 2026 18:55

hlts2 and others added 5 commits April 14, 2026 04:20

refactor: move node pool ID parsing into WithNodePoolIDs option

d364bb7

WithNodePoolIDs now accepts a comma-separated string and parses internally. Remove parseNodePoolIDs from main.go. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

docs: add TODO for standard node Drain → Replace recovery flow

6167e5a

Standard nodes should transition to PhaseDrain → PhaseReplace instead of retrying reboot indefinitely. GPU nodes keep reboot-only. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

refactor: move monitor-only flag parsing into WithMonitorOnly option

2bd4898

WithMonitorOnly now accepts a string and parses internally via strconv.ParseBool. Empty or unparsable values preserve the default (true). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

fix: add WithMonitorOnly to defaultOptions

b061f02

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

hlts2 changed the title ~~feat: node recovery state machine~~ feat: node recovery state machine MVP Apr 13, 2026

hlts2 and others added 6 commits April 14, 2026 10:30

refactor: remove unused FailedCheckers getter from NodeState

8cd0e35

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

feat: add CiliumAgent health checker

9f8a754

Monitors CiliumAgentIsReady node condition. Auto-skips nodes where the condition is absent (Cilium not installed). Threshold: 10min. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

fix: capitalize GPU checker reason messages for consistency

953abd3

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

refactor: move hasGPU to health.HasGPU

5041abd

GPU node detection belongs in the health package where gpuCountLabel is defined. Removes strconv and corev1 imports from watcher. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

hlts2 changed the title ~~feat: node recovery state machine MVP~~ feat: node recovery agent Apr 14, 2026

hlts2 and others added 8 commits April 14, 2026 15:11

refactor: extract threshold values into local constants

c583f7c

Each checker now defines its threshold as a named constant (nodeReadyThreshold, gpuThreshold, diskPressureThreshold, ciliumThreshold). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

refactor: rename nodeSelector to nodeLabelSelector and inline FormatL…

edefca8

…abelSelector Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

refactor: unexport test-only options WithNowFunc and WithNodeLister

53a4799

These options are only used within the watcher package tests, so they don't need to be exported. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

fix: use UTC for all internal timestamps

cad145f

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

refactor: remove redundant state.Phase() call in recovery check

df5b428

Use if-init statement to call Phase() once. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

docs: add state transition comments to reconcile loop

4943091

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

hlts2 marked this pull request as ready for review April 14, 2026 18:59

hlts2 self-assigned this Apr 14, 2026

hlts2 and others added 6 commits April 15, 2026 04:16

fix: use clientCfgPath instead of cfg in kubeconfig error message

d05744b

cfg may be nil when BuildConfigFromFlags fails. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

fix: skip label selector when no node pool IDs are configured

6bb50ca

FormatLabelSelector(nil) returns "<none>" which is an invalid selector. Only apply the label selector option when nodeLabelSelector is non-nil. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

fix: log node label selector on informer setup

9a900cd

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: node recovery agent#28

feat: node recovery agent#28
hlts2 wants to merge 40 commits intomainfrom
feat/node-recovery

hlts2 commented Apr 13, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

hlts2 commented Apr 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Environment Variables

Command Line Flags

Design Decisions

Not in This PR

Test Plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

hlts2 commented Apr 13, 2026 •

edited

Loading