Improve documentation & playbook for DGX firmware upgrade, include paramiko in setup.sh#1058
Conversation
aefc3b1 to
b69405b
Compare
ajdecon
left a comment
There was a problem hiding this comment.
Walked through this process and tested on a DGX A100 system.
The actual Ansible execution went fine, and I was able to successfully execute both diagnostic and update runs.
Most of the inline comments are suggested additions or clarifications to the docs.
The only blocking request here is in the setup.sh, where paramiko is misspelled. 😉
| netaddr \ | ||
| ruamel.yaml \ | ||
| PyMySQL \ | ||
| parimiko \ |
| ## Collect Diagnostics | ||
|
|
||
| The `nvidia-dgx-diag.yml` playbook leverages the `nvidia-dgx-firmware` role to run a diagnostic. This will collect health and configuration information for all nodes across a cluster. After being executed all logs will be copied locally to the provisioning system at `config/logs`. Logs are stored by hostname with timestamps. To change where logs are stored change the `local_log_directory` variable. | ||
| The [nvidia-dgx-diag.yml](../../playbooks/nvidia-dgx/nvidia-dgx-diag.yml) playbook leverages the [nvidia-dgx-firmware](../../roles/nvidia-dgx-firmware) role to run a diagnostic. This will collect health and configuration information for all nodes across a cluster. After being executed all logs will be copied locally to the provisioning system at `config/logs`. Logs are stored by hostname with timestamps. To change where logs are stored change the `local_log_directory` variable. |
There was a problem hiding this comment.
Nit: You probably want to add a note that the diag playbook still requires the firmware update container, e.g. to check the firmware versions. Or potentially add a flag in run-diagnostics.yml to make running check-firmware.yml optional?
| @@ -9,20 +9,21 @@ While documentation exists to [run system health checks](https://docs.nvidia.com | |||
|
|
|||
|
|
|||
| ## Setup | |||
There was a problem hiding this comment.
The FWUC release notes may specify a minimum DGX OS release (e.g., 21.11.x requires 5.0.1 or greater). We should add a note to this doc to make sure to check the FWUC release notes.
| If running on a DGX cluster, it is necessary to provide the DGX firmware container in order to gather installed firmware information or perform firmware updates. If running on a non-DGX cluster skip this first step and set `load_firmware` and `update_firmware` to `false`. | ||
|
|
||
| 1. Download the latest [DGX firmware container](https://docs.nvidia.com/dgx/dgxa100-fw-container-release-notes/index.html) and put it in `src/containers/dgx-firmware`, keeping the original file name. Update the role variables to reflect the version being used. | ||
| 1. Download the latest [DGX firmware container](https://docs.nvidia.com/dgx/dgxa100-fw-container-release-notes/index.html) and put it in `config/containers/dgx-firmware`, keeping the original file name. Add the following variables to `config/group_vars/all.yml` yaml file, reflecting the version being used. |
There was a problem hiding this comment.
Specify to download in tar.gz format.
| 1. Download the latest [DGX firmware container](https://docs.nvidia.com/dgx/dgxa100-fw-container-release-notes/index.html) and put it in `config/containers/dgx-firmware`, keeping the original file name. Add the following variables to `config/group_vars/all.yml` yaml file, reflecting the version being used. | ||
|
|
||
| ```yml | ||
| # The Docker repo name |
There was a problem hiding this comment.
In comment, note that this will change depending on the DGX hardware type.
| firmware_update_container: "nvfw-dgxa100_21.11.2_211102.tar.gz" | ||
| ``` | ||
|
|
||
| 2. Change the `nv_mgmt_interface` variable to reflect the systems being collected from. |
There was a problem hiding this comment.
Make a note to check the actual interface in use. The examples will only apply if the ConnectX-6 is being used for management, not if using the RJ-45.
| After verifying a single node can be successfully upgraded, upgrade the rest of the cluster with the Ansible automation. For larger clusters it is recommended to do this in batches. Perform an initial test of the provisioning node by updating a single node with Ansible and then deploy in batches of ~40 nodes. It is not necessary to do this, but in the case of an error or outage in the provisioning node this will reduce risk of firmware upgrade failure. | ||
| After verifying that a single node can be successfully upgraded, upgrade the rest of the cluster with the Ansible automation. For larger clusters it is recommended to do this in batches. Perform an initial test of the provisioning node by updating a single node with Ansible and then deploy in batches of ~40 nodes. It is not necessary to do this, but in the case of an error or outage in the provisioning node this will reduce risk of firmware upgrade failure. | ||
|
|
||
| Depending on the number of nodes and the required firmware updates, this process can take 1-4 hours to complete. For very large clusters it is not uncommon for some nodes to fail to update. This can occur for several reasons such as timeouts or networking issues. When this occurs manually inspect the logs, run an `nvsm show health` and if healthy attempt to re-run the playbook on those nodes. If failures persist contact NVIDIA support and attempt the upgrade manually. |
There was a problem hiding this comment.
Also refer the customer to the FWUC release notes for a table of expected time to complete the update.
|
|
||
| # reset the BMC on all nodes after a BMC firmware update | ||
| ansible slurm-node -ba "ipmitool mc reset cold" | ||
| # Reset the BMC on all nodes after a BMC firmware update |
There was a problem hiding this comment.
This should be done automatically by the firmware update, should not be needed on DGX A100? But we should verify for DGX-1 and DGX-2.
ajdecon
left a comment
There was a problem hiding this comment.
Approving to merge and unblock dependencies, will add some extra changes in a later PR
This is an overall improvement to the documentation flow for updating DGX firmware. In updating these docs I went ahead and made some of the playbooks a bit more robust and improved the usability a bit.