Open
Conversation
… and metrics Introduce a proper recovery state machine (Healthy → Unhealthy → WaitingReboot) with pluggable health checkers, Prometheus metrics, and monitor-only mode support. - Add pkg/health: HealthChecker interface with NodeReady and GPU checkers - Add pkg/operation: Executor interface with CivoExecutor (FOP-based) - Add pkg/metrics: Prometheus metrics (health checks, recovery actions, phase, duration) - Add pkg/watcher/state: NodePhase enum, NodeState (private fields), StateStore - Refactor watcher: replace polling with Node Informer, state machine reconcile loop - Remove legacy code: fake.go, inline reboot/check functions, sync.Map tracking - Update main.go: new env vars (CIVO_NODE_MONITOR_ONLY, CIVO_NODE_UNHEALTHY_THRESHOLD_MINUTES, CIVO_NODE_METRICS_PORT) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…ation - Rename reboot.go to civo.go to better reflect the Civo executor scope - Extract FOP (Option, WithAPIConfig, WithClient) into options.go - Add validation for clusterID, apiKey, and apiURL before Civo client creation Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: hlts2 <hiroto.funakoshi.hiroto@gmail.com> Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: hlts2 <hiroto.funakoshi.hiroto@gmail.com> Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: hlts2 <hiroto.funakoshi.hiroto@gmail.com> Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Add Threshold() to HealthChecker interface for per-checker thresholds - NodeReady: 10min, GPU: 10min, DiskPressure: 30min - Remove single unhealthyThreshold from watcher; use min threshold of failed checkers - Remove WithUnhealthyThresholdMinutes option and CIVO_NODE_UNHEALTHY_THRESHOLD_MINUTES env var - Refactor main.go: parseUintOrZero, defaultMetricsPort as int Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…ssages in NewWatcher Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…config flag - Remove nodeDesiredGPUCount field, WithDesiredGPUCount option (GPU count is owned by checker) - Remove NewGPUChecker public constructor (only used internally by NewDefaultCheckers) - Add --kubeconfig flag with default /etc/rancher/k3s/k3s.yaml for CP VM Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Change NewWatcher signature: nodePoolID string → nodePoolIDs []string - Add buildNodeSelector: empty=all nodes, single=MatchLabels, multiple=In operator - Add parseNodePoolIDs in main.go for comma-separated CIVO_NODE_POOL_ID Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
… var - GPU checker now compares nvidia.com/gpu.count label (expected) vs allocatable (actual) - Auto-skips non-GPU nodes when label is absent - Remove CIVO_NODE_DESIRED_GPU_COUNT env var and NewDefaultCheckers parameter - NewDefaultCheckers() always includes GPU checker (self-determines GPU node) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…tion - NewWatcher signature: (ctx, clusterID, nodePoolIDs, opts...) → (ctx, clusterID, opts...) - Add WithNodePoolIDs option with append semantics (empty is no-op) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
clusterID was only used by the old rebootNode() which moved to the executor. NewWatcher signature simplified: (ctx, clusterID, opts...) → (ctx, opts...) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The label is static (set by GFD) and correctly identifies GPU nodes even when all GPUs are unhealthy and allocatable drops to 0. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…ttings - CIVO_NODE_MONITOR_ONLY → CIVO_NODE_AGENT_MONITOR_ONLY - CIVO_NODE_METRICS_PORT → CIVO_NODE_AGENT_METRICS_PORT Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
d8bc10b to
3204673
Compare
- Standard nodes: reboot wait 10min (CIVO_NODE_REBOOT_WAIT_MINUTES) - GPU nodes: reboot wait 40min (CIVO_GPU_NODE_REBOOT_WAIT_MINUTES) - NodeReady threshold changed from 10min to 5min per TDD - Replace single rebootTimeWindowMinutes with rebootWaitMinutes + gpuRebootWaitMinutes - WaitingReboot phase checks state.IsGPUNode() to select appropriate wait time Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
WithNodePoolIDs now accepts a comma-separated string and parses internally. Remove parseNodePoolIDs from main.go. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Standard nodes should transition to PhaseDrain → PhaseReplace instead of retrying reboot indefinitely. GPU nodes keep reboot-only. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
WithMonitorOnly now accepts a string and parses internally via strconv.ParseBool. Empty or unparsable values preserve the default (true). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Monitors CiliumAgentIsReady node condition. Auto-skips nodes where the condition is absent (Cilium not installed). Threshold: 10min. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…um checker Cilium sets NetworkUnavailable=False with reason CiliumIsUp, not a custom CiliumAgentIsReady condition. Skip if reason is not CiliumIsUp (other CNI). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Check now returns (bool, string) where the string is the reason. Condition-based checkers pass through cond.Reason directly. GPU checker returns a descriptive reason (e.g. "expected 8 but got 7"). The reason is used as the result label in HealthCheckTotal metric. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
GPU node detection belongs in the health package where gpuCountLabel is defined. Removes strconv and corev1 imports from watcher. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…odeSelector - Health: Threshold tests for all checkers, HasGPU tests - Operation: NewCivoExecutor validation tests (empty clusterID, apiKey, apiURL) - Watcher: buildNodeSelector tests (nil, single, multiple) - Coverage: health 84→98%, operation 50→81%, total 71→76% Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Each checker now defines its threshold as a named constant (nodeReadyThreshold, gpuThreshold, diskPressureThreshold, ciliumThreshold). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Add NewNopExecutor() that performs no operations - Set as default in defaultOptions - Add nil check in WithExecutor to preserve default when nil is passed Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…abelSelector Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
These options are only used within the watcher package tests, so they don't need to be exported. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Rename 'name' to 'description' in test struct fields
- Use verb-driven descriptions ("returns", "detects", "skips", etc.)
- Use 'test' instead of 'tt' in range loops
- Initialize mocks inside t.Run for isolation (operation tests)
- Use test.description in t.Run calls
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Use if-init statement to call Phase() once. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Metrics server now uses http.Server with Shutdown() on context cancellation - Ignores http.ErrServerClosed on normal shutdown - Fix WithKubernetesClientConfigPath godoc (was copy of WithKubernetesClient) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
cfg may be nil when BuildConfigFromFlags fails. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…y metrics Replace dynamic formatted reasons (e.g. "Expected 8 but got 7") with enumerable constants: GPUCountMatch, GPUCountMismatch, NoAllocatableGPU, NonGPUNode. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
When a node is removed from the cluster, Cleanup only deleted the state entry but left gauge metrics (recovery_phase, unhealthy_duration) with stale values. Now watcher.run() deletes all metric labels for removed nodes before calling Cleanup. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
FormatLabelSelector(nil) returns "<none>" which is an invalid selector. Only apply the label selector option when nodeLabelSelector is non-nil. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Healthy → Unhealthy → WaitingReboot) withPhaseUnknownas zero valuepkg/health:HealthCheckerinterface with per-checker thresholds and reason reportingNodeReadyChecker(5min),GPUChecker(10min),DiskPressureChecker(30min),CiliumChecker(10min)nvidia.com/gpu.countlabel vsallocatable(no static env var)NetworkUnavailablecondition withCiliumIsUpreason (auto-skips non-Cilium CNI)Check()returns(bool, string)— reason is used as metrics label for observabilitypkg/operation:Executorinterface withCivoExecutor(Functional Option Pattern) andNopExecutordefaultpkg/metrics: Prometheus metrics (health checks with reason, recovery actions, phase, unhealthy duration)pkg/watcher/state:NodePhaseenum,NodeState(private fields + getters),StateStorewith transition methodsfake.go, inline reboot/check functions,sync.Maptracking,CIVO_NODE_DESIRED_GPU_COUNTEnvironment Variables
CIVO_NODE_AGENT_MONITOR_ONLYtrueCIVO_NODE_AGENT_METRICS_PORT9625CIVO_NODE_POOL_IDCIVO_NODE_REBOOT_WAIT_MINUTES10CIVO_GPU_NODE_REBOOT_WAIT_MINUTES40Command Line Flags
--kubeconfig/etc/rancher/k3s/k3s.yaml--versionDesign Decisions
true): logs recovery actions without executing themnvidia.com/gpu.count(static, set by GFD) vsallocatable(dynamic);health.HasGPUalso uses label to correctly identify GPU nodes even when all GPUs are unhealthyStateStoretransition methodswithNowFunc,withNodeListerNot in This PR
Test Plan
go test ./...passes (all 4 packages)go vet ./...cleancurl localhost:9625/metricsreturns Prometheus metrics in a test clusterCIVO_NODE_AGENT_MONITOR_ONLY=truelogs but does not rebootCIVO_NODE_AGENT_MONITOR_ONLY=falsetriggers actualHardRebootInstanceAPI call