Improve documentation & playbook for DGX firmware upgrade, include paramiko in setup.sh#1058

Merged

ajdecon merged 2 commits intoNVIDIA:masterfrom

supertetelman:update-fw

Nov 30, 2021

Contributor

supertetelman commented Nov 12, 2021

This is an overall improvement to the documentation flow for updating DGX firmware. In updating these docs I went ahead and made some of the playbooks a bit more robust and improved the usability a bit.

Install paramiko via pip in setup script
Improve overall documentation flow with better examples and more details for diagnostics/firmware updates
Modify playbook to not throw errors if log files/diagnostics were not generated
Introduced flag for backing up or skipping inactive firmware components

supertetelman added 2 commits

November 12, 2021 14:17


          Improve documentation & playbook for DGX firmware upgrade, include pa…

2a3ae9e

…ramiko in setup.sh


          Minor wording updates to firmware playbooks

b69405b

supertetelman force-pushed the update-fw branch from aefc3b1 to b69405b Compare

November 12, 2021 22:56

ajdecon suggested changes

View reviewed changes

Contributor

ajdecon left a comment

Walked through this process and tested on a DGX A100 system.

The actual Ansible execution went fine, and I was able to successfully execute both diagnostic and update runs.

Most of the inline comments are suggested additions or clarifications to the docs.

The only blocking request here is in the setup.sh, where paramiko is misspelled. 😉

scripts/setup.sh

                       netaddr \
                       ruamel.yaml \
                       PyMySQL \
+                      parimiko \

Contributor

ajdecon Nov 22, 2021

This should be paramiko

docs/deepops/dgx-diagnostic-firmware.md

    
              ## Collect Diagnostics

              The `nvidia-dgx-diag.yml` playbook leverages the `nvidia-dgx-firmware` role to run a diagnostic. This will collect health and configuration information for all nodes across a cluster. After being executed all logs will be copied locally to the provisioning system at `config/logs`. Logs are stored by hostname with timestamps. To change where logs are stored change the `local_log_directory` variable.

              The [nvidia-dgx-diag.yml](../../playbooks/nvidia-dgx/nvidia-dgx-diag.yml) playbook leverages the [nvidia-dgx-firmware](../../roles/nvidia-dgx-firmware) role to run a diagnostic. This will collect health and configuration information for all nodes across a cluster. After being executed all logs will be copied locally to the provisioning system at `config/logs`. Logs are stored by hostname with timestamps. To change where logs are stored change the `local_log_directory` variable.

Contributor

ajdecon Nov 23, 2021

Nit: You probably want to add a note that the diag playbook still requires the firmware update container, e.g. to check the firmware versions. Or potentially add a flag in run-diagnostics.yml to make running check-firmware.yml optional?

docs/deepops/dgx-diagnostic-firmware.md

		@@ -9,20 +9,21 @@ While documentation exists to [run system health checks](https://docs.nvidia.com


		## Setup

Contributor

ajdecon Nov 23, 2021

The FWUC release notes may specify a minimum DGX OS release (e.g., 21.11.x requires 5.0.1 or greater). We should add a note to this doc to make sure to check the FWUC release notes.

docs/deepops/dgx-diagnostic-firmware.md

    
              If running on a DGX cluster, it is necessary to provide the DGX firmware container in order to gather installed firmware information or perform firmware updates. If running on a non-DGX cluster skip this first step and set `load_firmware` and `update_firmware` to `false`.

              1. Download the latest [DGX firmware container](https://docs.nvidia.com/dgx/dgxa100-fw-container-release-notes/index.html) and put it in `src/containers/dgx-firmware`, keeping the original file name. Update the role variables to reflect the version being used.

              1. Download the latest [DGX firmware container](https://docs.nvidia.com/dgx/dgxa100-fw-container-release-notes/index.html) and put it in `config/containers/dgx-firmware`, keeping the original file name. Add the following variables to `config/group_vars/all.yml` yaml file, reflecting the version being used.

Contributor

ajdecon Nov 23, 2021

Specify to download in tar.gz format.

docs/deepops/dgx-diagnostic-firmware.md

    
              1. Download the latest [DGX firmware container](https://docs.nvidia.com/dgx/dgxa100-fw-container-release-notes/index.html) and put it in `config/containers/dgx-firmware`, keeping the original file name. Add the following variables to `config/group_vars/all.yml` yaml file, reflecting the version being used.

              ```yml

              # The Docker repo name

Contributor

ajdecon Nov 23, 2021

In comment, note that this will change depending on the DGX hardware type.

docs/deepops/dgx-diagnostic-firmware.md

+              firmware_update_container: "nvfw-dgxa100_21.11.2_211102.tar.gz"
               ```
 . Change the `nv_mgmt_interface` variable to reflect the systems being collected from.

Contributor

ajdecon Nov 23, 2021

Make a note to check the actual interface in use. The examples will only apply if the ConnectX-6 is being used for management, not if using the RJ-45.

docs/deepops/dgx-diagnostic-firmware.md

-              After verifying a single node can be successfully upgraded, upgrade the rest of the cluster with the Ansible automation. For larger clusters it is recommended to do this in batches. Perform an initial test of the provisioning node by updating a single node with Ansible and then deploy in batches of ~40 nodes. It is not necessary to do this, but in the case of an error or outage in the provisioning node this will reduce risk of firmware upgrade failure.
+              After verifying that a single node can be successfully upgraded, upgrade the rest of the cluster with the Ansible automation. For larger clusters it is recommended to do this in batches. Perform an initial test of the provisioning node by updating a single node with Ansible and then deploy in batches of ~40 nodes. It is not necessary to do this, but in the case of an error or outage in the provisioning node this will reduce risk of firmware upgrade failure.
               Depending on the number of nodes and the required firmware updates, this process can take 1-4 hours to complete. For very large clusters it is not uncommon for some nodes to fail to update. This can occur for several reasons such as timeouts or networking issues.  When this occurs manually inspect the logs, run an `nvsm show health` and if healthy attempt to re-run the playbook on those nodes. If failures persist contact NVIDIA support and attempt the upgrade manually.

Contributor

ajdecon Nov 23, 2021

Also refer the customer to the FWUC release notes for a table of expected time to complete the update.

docs/deepops/dgx-diagnostic-firmware.md

-              # reset the BMC on all nodes after a BMC firmware update
-              ansible slurm-node -ba "ipmitool mc reset cold"
+              # Reset the BMC on all nodes after a BMC firmware update

Contributor

ajdecon Nov 23, 2021

This should be done automatically by the firmware update, should not be needed on DGX A100? But we should verify for DGX-1 and DGX-2.

ajdecon approved these changes

View reviewed changes

Contributor

ajdecon left a comment

Approving to merge and unblock dependencies, will add some extra changes in a later PR

ajdecon merged commit e5aab2b into NVIDIA:master

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet