GPU Operator automation with NVIDIA AI Enterprise#1059
Merged
ajdecon merged 4 commits intoNVIDIA:masterfrom Dec 3, 2021
Merged
GPU Operator automation with NVIDIA AI Enterprise#1059ajdecon merged 4 commits intoNVIDIA:masterfrom
ajdecon merged 4 commits intoNVIDIA:masterfrom
Conversation
ajdecon
approved these changes
Nov 18, 2021
Contributor
ajdecon
left a comment
There was a problem hiding this comment.
@iamadrigal : LGTM!
Tested on a three-node VM cluster with two worker nodes hosting A100 GPUs (thanks for providing the test platform!). Performed the following tests in order to validate the successful deployment:
# On the k8s control plane node:
nvidia@deepops-admin:~$ sudo kubectl get pods -n gpu-operator-resources
NAME READY STATUS RESTARTS AGE
gpu-feature-discovery-4rbnl 1/1 Running 0 19m
gpu-feature-discovery-gwg2v 1/1 Running 0 19m
gpu-operator-7bcc547564-vf5nq 1/1 Running 0 19m
nvidia-container-toolkit-daemonset-9rxrm 1/1 Running 0 19m
nvidia-container-toolkit-daemonset-tm24d 1/1 Running 0 19m
nvidia-cuda-validator-kgcxl 0/1 Completed 0 17m
nvidia-cuda-validator-mdldv 0/1 Completed 0 17m
nvidia-dcgm-exporter-n48tb 1/1 Running 0 19m
nvidia-dcgm-exporter-r5q5c 1/1 Running 0 19m
nvidia-device-plugin-daemonset-7p6zc 1/1 Running 0 19m
nvidia-device-plugin-daemonset-w47mb 1/1 Running 0 19m
nvidia-device-plugin-validator-df8hb 0/1 Completed 0 16m
nvidia-device-plugin-validator-ftvhl 0/1 Completed 0 17m
nvidia-driver-daemonset-2www4 1/1 Running 0 19m
nvidia-driver-daemonset-vs9f5 1/1 Running 0 19m
nvidia-gpu-operator-node-feature-discovery-master-74db7c56lxmd7 1/1 Running 0 19m
nvidia-gpu-operator-node-feature-discovery-worker-77nt8 1/1 Running 0 19m
nvidia-gpu-operator-node-feature-discovery-worker-bld82 1/1 Running 0 19m
nvidia-gpu-operator-node-feature-discovery-worker-tv4vl 1/1 Running 0 19m
nvidia-mig-manager-bhqkl 1/1 Running 0 17m
nvidia-mig-manager-ff52n 1/1 Running 0 16m
nvidia-operator-validator-7nhgn 1/1 Running 0 19m
nvidia-operator-validator-flpvc 1/1 Running 0 19m
nvidia@deepops-admin:~$ sudo kubectl logs -n gpu-operator-resources nvidia-cuda-validator-kgcxl
cuda workload validation is successful
nvidia@deepops-admin:~$ sudo kubectl logs -n gpu-operator-resources nvidia-device-plugin-validator-df8hb
device-plugin workload validation is successful
nvidia@deepops-admin:~$ sudo kubectl exec -n gpu-operator-resources --stdin --tty nvidia-device-plugin-daemonset-7p6zc -- /bin/bash
[root@nvidia-device-plugin-daemonset-7p6zc /]# nvidia-smi -L
GPU 0: GRID A100-2-10C (UUID: GPU-1dfdcca5-28f9-11b2-904f-4c80d4e1ed0c)
MIG 2g.10gb Device 0: (UUID: MIG-f1029a2c-305f-5223-af78-3136dd9fde27)
Note: in order to successfully deploy, I had to make the following changes to a default DeepOps configuration:
# Required configuration for NVAIE
deepops_gpu_operator_enabled: true
gpu_operator_nvaie_enable: true
gpu_operator_chart_version: "1.8.1"
gpu_operator_driver_registry: "nvcr.io/nvaie"
gpu_operator_driver_version: "470.63.01
gpu_operator_registry_email: "<my-email-address>"
gpu_operator_registry_password: "<my-ngc-key>"
gpu_operator_nvaie_nls_token: "<my-nls-token>"
For documentation purposes, I'd suggest adding the lines above to the config.example/group_vars/k8s-cluster.yml file, commented out to provide an example of how to configure NVAIE.
I don't think that's a blocker to approving the PR, but if you want to add those lines to the example config then I will re-approve the PR when done!
4 tasks
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This PR contains modifications to enable GPU Operator configuration when using deepops on vGPU clusters using NVIDIA AI Enterprise.