Fix some ordering in k8s-cluster.yml to install Helm properly and run all commands from kube-master[0]- fixing CentOS install by supertetelman · Pull Request #1128 · NVIDIA/deepops

supertetelman · 2022-03-10T01:55:19Z

Our Helm installs were doing a mix of running from localhost and/or kube-master[0]. This was causing issues in the nfs-client-provisioner because the CentOS kubespray installer was not properly installing kubectl on the kube-master nodes.

For now I am aligning everything to what we did in GPU Operator. In the future, it would make sense to use the now functional helm Ansible module and run things from localhost (the provisioning node) instead of the kube-master[0]. This would simply allow us to install less binaries on the management nodes, but beyond that it is not a necessary change.

Also added the standard proxy commands to a few places where they were missing in helm installs.

Additionally I moved the block of code that runs helm/kubectl commands to be after the block where we actually install the proper kubectl/helm binaries. This was causing issues in the edge-cases on CentOS because of how different software was installed across Ubuntu/CentOS.

The automated testing already tests all the paths that this touches.

…and cleanup comments

supertetelman · 2022-03-23T01:14:41Z

playbooks/k8s-cluster/nvidia-gpu-operator.yml

 - include: ../bootstrap/bootstrap-openshift.yml

 # GPU operator
- hosts: kube-master[0]


The expectation is that Helm commands are run from the provisioning node. No need to install Helm and run it on the management systems.

…aybooks the same

ajdecon

Two issues to address:

ansible-lint failed with a minor spacing issue:

Linting ./nfs-client-provisioner
WARNING  Listing 1 violation(s) that are fatal
tasks/main.yml:4: [var-spacing] [LOW] Variables should have spaces before and after: "{{k8s_nfs_client_repo_name}}"
Warning: var-spacing Variables should have spaces before and after: "{{k8s_nfs_client_repo_name}}"
You can skip specific rules or tags by adding them to your configuration file:
# .ansible-lint
warn_list:  # or 'skip_list' to silence them completely
  - var-spacing  # Variables should have spaces before and after:  {{ var_name }}

Finished with 1 failure(s), 0 warning(s) on 2 files.

The Jenkins end-to-end test failed. This might be a transient failure, so it's worth re-running, then debugging if it repeats.

TASK [install nfs-client-provisioner] ******************************************
fatal: [localhost]: FAILED! => changed=false 
  cmd:
  - /usr/local/bin/helm
  - upgrade
  - --install
  - nfs-subdir-external-provisioner
  - nfs-subdir-external-provisioner/nfs-subdir-external-provisioner
  - --create-namespace
  - --namespace
  - deepops-nfs-client-provisioner
  - --version
  - 4.0.13
  - --set
  - nfs.server=127.0.0.1
  - --set
  - nfs.path=/export/deepops_nfs
  - --set
  - storageClass.defaultClass=true
  - --wait
  delta: '0:00:00.060010'
  end: '2022-03-23 03:29:00.175032'
  msg: non-zero return code
  rc: 1
  start: '2022-03-23 03:29:00.115022'
  stderr: |-
    Error: Kubernetes cluster unreachable: <html><head><meta http-equiv='refresh' content='1;url=/login?from=%2Fversion%3Ftimeout%3D32s'/><script>window.location.replace('/login?from=%2Fversion%3Ftimeout%3D32s');</script></head><body style='background-color:white; color:white;'>
  
  
    Authentication required
    <!--
    You are authenticated as: anonymous
    Groups that you are in:
  
    Permission you need to have (but didn't): hudson.model.Hudson.Read
     ... which is implied by: hudson.security.Permission.GenericRead
     ... which is implied by: hudson.model.Hudson.Administer
    -->
  
    </body></html>
  stderr_lines: <omitted>
  stdout: ''
  stdout_lines: <omitted>

roles/nvidia-gpu-operator/tasks/k8s.yml

playbooks/k8s-cluster.yml

playbooks/k8s-cluster/nfs-client-provisioner.yml

… kubectl-first

supertetelman added 3 commits March 9, 2022 22:52

Run operator install after, not before manually installing helm/kubectl

fe7338b

Run helm/kubectl commands from localhost, not mgmt[0]

49674ff

Remove unnecessary when: statement that was breaking localhost calls …

29a03d5

…and cleanup comments

supertetelman commented Mar 23, 2022

View reviewed changes

supertetelman added 3 commits March 22, 2022 18:44

Add proxy vars to gpu operator playbook

dbfda57

Default nfs var

fdd71d2

Make NFS client helm install same as everything else

9a11fa0

supertetelman changed the title ~~[WIP] Debugging CentOS test failures related to missing kubectl~~ Fix some ordering in k8s-cluster.yml to install Helm properly and run all commands from localhost - fixing CentOS install Mar 23, 2022

resolve merge conflicts from linting

62fd52c

supertetelman marked this pull request as ready for review March 23, 2022 02:31

supertetelman added the next-release Critical for the next release label Mar 23, 2022

supertetelman requested a review from ajdecon March 23, 2022 02:31

supertetelman added 3 commits March 22, 2022 19:55

Fix a typo from merg conflict resolution

5b006f0

Install gpu-operator/nfs via helm from mgmt node and make all helm pl…

68f6a63

…aybooks the same

Remove extra kubectl label command that is should be dynamic in operator

9eb0495

ajdecon suggested changes Mar 23, 2022

View reviewed changes

supertetelman commented Mar 23, 2022

View reviewed changes

roles/nvidia-gpu-operator/tasks/k8s.yml Show resolved Hide resolved

supertetelman changed the title ~~Fix some ordering in k8s-cluster.yml to install Helm properly and run all commands from localhost - fixing CentOS install~~ Fix some ordering in k8s-cluster.yml to install Helm properly and run all commands from kube-master[0]- fixing CentOS install Mar 23, 2022

supertetelman commented Mar 23, 2022

View reviewed changes

playbooks/k8s-cluster.yml Show resolved Hide resolved

supertetelman commented Mar 23, 2022

View reviewed changes

playbooks/k8s-cluster/nfs-client-provisioner.yml Show resolved Hide resolved

supertetelman mentioned this pull request Mar 23, 2022

nfs-client-provisioner failing when become_user is defined #1099

Closed

supertetelman and others added 3 commits March 23, 2022 12:46

Merge branch 'master' of github.com:NVIDIA/deepops into kubectl-first

eb9e4a9

Fix linting in nfs-client post conflict merge

b124a10

Merge branch 'master' into kubectl-first

7f6fc1f

supertetelman requested a review from ajdecon March 24, 2022 15:50

supertetelman added 2 commits March 24, 2022 12:02

Bump up dcmgm test timeout so we get debug on failures

64704ad

Merge branch 'kubectl-first' of github.com:supertetelman/deepops into…

6329d0e

… kubectl-first

ajdecon approved these changes Mar 24, 2022

View reviewed changes

ajdecon merged commit c01e64f into NVIDIA:master Mar 24, 2022

ajdecon mentioned this pull request Apr 26, 2022

DeepOps Release 22.04 #1164

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix some ordering in k8s-cluster.yml to install Helm properly and run all commands from kube-master[0]- fixing CentOS install#1128

Fix some ordering in k8s-cluster.yml to install Helm properly and run all commands from kube-master[0]- fixing CentOS install#1128
ajdecon merged 15 commits intoNVIDIA:masterfrom
supertetelman:kubectl-first

supertetelman commented Mar 10, 2022 •

edited

Loading

Uh oh!

supertetelman Mar 23, 2022

Uh oh!

ajdecon left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

supertetelman commented Mar 10, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

supertetelman Mar 23, 2022

Choose a reason for hiding this comment

Uh oh!

ajdecon left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

supertetelman commented Mar 10, 2022 •

edited

Loading