fix(bootstrap): surface Helm install failure on namespace timeout (#211)#486
fix(bootstrap): surface Helm install failure on namespace timeout (#211)#486Manoj-engineer wants to merge 1 commit intoNVIDIA:mainfrom
Conversation
| let (job_output, job_exit) = exec_capture_with_exit( | ||
| docker, | ||
| container_name, | ||
| vec![ | ||
| "sh".to_string(), | ||
| "-c".to_string(), | ||
| format!( | ||
| "KUBECONFIG={kubeconfig} kubectl get jobs -n kube-system \ | ||
| --no-headers -o custom-columns=NAME:.metadata.name,FAILED:.status.failed \ | ||
| 2>/dev/null | awk '{{if ($2 != \"0\") print $1}}'" | ||
| ), | ||
| ], | ||
| ) | ||
| .await | ||
| .ok()?; |
There was a problem hiding this comment.
Please make sure this won't return false positives. If the namespace timeout happens for any other reason, this predicate will still select completed k3s Helm jobs because .status.failed is usually absent for successful Jobs.
| GatewayMetadata, clear_active_gateway, extract_host_from_ssh_destination, get_gateway_metadata, | ||
| list_gateways, load_active_gateway, load_gateway_metadata, load_last_sandbox, | ||
| remove_gateway_metadata, resolve_ssh_hostname, save_active_gateway, save_last_sandbox, | ||
| store_gateway_metadata, | ||
| GatewayMetadata, clear_active_gateway, extract_host_from_ssh_destination, | ||
| get_gateway_metadata, list_gateways, load_active_gateway, load_gateway_metadata, | ||
| load_last_sandbox, remove_gateway_metadata, resolve_ssh_hostname, save_active_gateway, | ||
| save_last_sandbox, store_gateway_metadata, | ||
| }; |
There was a problem hiding this comment.
Review failing CI checks
[rust:format:check] $ cargo fmt --all -- --check
info: syncing channel updates for stable-x86_64-unknown-linux-gnu
info: latest update on 2026-03-05 for version 1.94.0 (4a4ef493e 2026-03-02)
info: downloading 6 components
Diff in /__w/OpenShell/OpenShell/crates/openshell-bootstrap/src/lib.rs:48:
DockerPreflight, ExistingGatewayInfo, check_docker_available, create_ssh_docker_client,
};
pub use crate::metadata::{
- GatewayMetadata, clear_active_gateway, extract_host_from_ssh_destination,
- get_gateway_metadata, list_gateways, load_active_gateway, load_gateway_metadata,
- load_last_sandbox, remove_gateway_metadata, resolve_ssh_hostname, save_active_gateway,
- save_last_sandbox, store_gateway_metadata,
+ GatewayMetadata, clear_active_gateway, extract_host_from_ssh_destination, get_gateway_metadata,
+ list_gateways, load_active_gateway, load_gateway_metadata, load_last_sandbox,
+ remove_gateway_metadata, resolve_ssh_hostname, save_active_gateway, save_last_sandbox,
+ store_gateway_metadata,
};
/// Options for remote SSH deployment.
[rust:format:check] ERROR task failed
There was a problem hiding this comment.
Good catch on both. Fixed:
False positives — changed the awk filter from $2 != "0" to $2 ~ /^[1-9]/ so it only matches numeric non-zero values. (absent status.failed on successful jobs) no longer matches.
Formatting — the rebase picked up upstream's import line-wrapping change which conflicted with our format; ran cargo fmt --all to resolve it.
90/90 tests still pass.
…IDIA#211) Signed-off-by: Manoj-engineer <194872717+Manoj-engineer@users.noreply.github.com>
718d44e to
7541eb5
Compare
|
I am closing this as we'll be moving away from K3s for single-tenant deployment, please see roadmap for future supported deployment modes: https://github.com/orgs/NVIDIA/projects/233/views/1 |
Vouched at: #420
Summary
when
gateway starttimes out waiting for theopenshellnamespace, the errormessage now checks for failed
helm-install-*jobs inkube-systemand surfacesthe actual Helm error and last 30 log lines instead of the generic "namespace not ready" message.
Related Issue
Fixes #211
Changes
diagnose_helm_failure()inopenshell-bootstrap/src/lib.rsthat querieshelm-install-*jobs inkube-systemfor failed pods and returns job conditionswait_for_namespace()final timeout branchstatus.failedstays<none>during backoff retry window;filter on
!= "0"instead of!= "<none>" && != "0"to catch actively-failing jobshelm_failure_hint_is_included_in_namespace_timeout_messageTesting
mise run pre-commitpassesserviceaccount.yaml,confirmed the Helm error appears in the terminal output on timeout
Checklist