Skip to content

Improve documentation & playbook for DGX firmware upgrade, include paramiko in setup.sh#1058

Merged
ajdecon merged 2 commits intoNVIDIA:masterfrom
supertetelman:update-fw
Nov 30, 2021
Merged

Improve documentation & playbook for DGX firmware upgrade, include paramiko in setup.sh#1058
ajdecon merged 2 commits intoNVIDIA:masterfrom
supertetelman:update-fw

Conversation

@supertetelman
Copy link
Copy Markdown
Contributor

This is an overall improvement to the documentation flow for updating DGX firmware. In updating these docs I went ahead and made some of the playbooks a bit more robust and improved the usability a bit.

  • Install paramiko via pip in setup script
  • Improve overall documentation flow with better examples and more details for diagnostics/firmware updates
  • Modify playbook to not throw errors if log files/diagnostics were not generated
  • Introduced flag for backing up or skipping inactive firmware components

Copy link
Copy Markdown
Contributor

@ajdecon ajdecon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Walked through this process and tested on a DGX A100 system.

The actual Ansible execution went fine, and I was able to successfully execute both diagnostic and update runs.

Most of the inline comments are suggested additions or clarifications to the docs.

The only blocking request here is in the setup.sh, where paramiko is misspelled. 😉

netaddr \
ruamel.yaml \
PyMySQL \
parimiko \
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be paramiko

## Collect Diagnostics

The `nvidia-dgx-diag.yml` playbook leverages the `nvidia-dgx-firmware` role to run a diagnostic. This will collect health and configuration information for all nodes across a cluster. After being executed all logs will be copied locally to the provisioning system at `config/logs`. Logs are stored by hostname with timestamps. To change where logs are stored change the `local_log_directory` variable.
The [nvidia-dgx-diag.yml](../../playbooks/nvidia-dgx/nvidia-dgx-diag.yml) playbook leverages the [nvidia-dgx-firmware](../../roles/nvidia-dgx-firmware) role to run a diagnostic. This will collect health and configuration information for all nodes across a cluster. After being executed all logs will be copied locally to the provisioning system at `config/logs`. Logs are stored by hostname with timestamps. To change where logs are stored change the `local_log_directory` variable.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: You probably want to add a note that the diag playbook still requires the firmware update container, e.g. to check the firmware versions. Or potentially add a flag in run-diagnostics.yml to make running check-firmware.yml optional?

@@ -9,20 +9,21 @@ While documentation exists to [run system health checks](https://docs.nvidia.com


## Setup
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The FWUC release notes may specify a minimum DGX OS release (e.g., 21.11.x requires 5.0.1 or greater). We should add a note to this doc to make sure to check the FWUC release notes.

If running on a DGX cluster, it is necessary to provide the DGX firmware container in order to gather installed firmware information or perform firmware updates. If running on a non-DGX cluster skip this first step and set `load_firmware` and `update_firmware` to `false`.

1. Download the latest [DGX firmware container](https://docs.nvidia.com/dgx/dgxa100-fw-container-release-notes/index.html) and put it in `src/containers/dgx-firmware`, keeping the original file name. Update the role variables to reflect the version being used.
1. Download the latest [DGX firmware container](https://docs.nvidia.com/dgx/dgxa100-fw-container-release-notes/index.html) and put it in `config/containers/dgx-firmware`, keeping the original file name. Add the following variables to `config/group_vars/all.yml` yaml file, reflecting the version being used.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Specify to download in tar.gz format.

1. Download the latest [DGX firmware container](https://docs.nvidia.com/dgx/dgxa100-fw-container-release-notes/index.html) and put it in `config/containers/dgx-firmware`, keeping the original file name. Add the following variables to `config/group_vars/all.yml` yaml file, reflecting the version being used.

```yml
# The Docker repo name
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In comment, note that this will change depending on the DGX hardware type.

firmware_update_container: "nvfw-dgxa100_21.11.2_211102.tar.gz"
```

2. Change the `nv_mgmt_interface` variable to reflect the systems being collected from.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Make a note to check the actual interface in use. The examples will only apply if the ConnectX-6 is being used for management, not if using the RJ-45.

After verifying a single node can be successfully upgraded, upgrade the rest of the cluster with the Ansible automation. For larger clusters it is recommended to do this in batches. Perform an initial test of the provisioning node by updating a single node with Ansible and then deploy in batches of ~40 nodes. It is not necessary to do this, but in the case of an error or outage in the provisioning node this will reduce risk of firmware upgrade failure.
After verifying that a single node can be successfully upgraded, upgrade the rest of the cluster with the Ansible automation. For larger clusters it is recommended to do this in batches. Perform an initial test of the provisioning node by updating a single node with Ansible and then deploy in batches of ~40 nodes. It is not necessary to do this, but in the case of an error or outage in the provisioning node this will reduce risk of firmware upgrade failure.

Depending on the number of nodes and the required firmware updates, this process can take 1-4 hours to complete. For very large clusters it is not uncommon for some nodes to fail to update. This can occur for several reasons such as timeouts or networking issues. When this occurs manually inspect the logs, run an `nvsm show health` and if healthy attempt to re-run the playbook on those nodes. If failures persist contact NVIDIA support and attempt the upgrade manually.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also refer the customer to the FWUC release notes for a table of expected time to complete the update.


# reset the BMC on all nodes after a BMC firmware update
ansible slurm-node -ba "ipmitool mc reset cold"
# Reset the BMC on all nodes after a BMC firmware update
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be done automatically by the firmware update, should not be needed on DGX A100? But we should verify for DGX-1 and DGX-2.

Copy link
Copy Markdown
Contributor

@ajdecon ajdecon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approving to merge and unblock dependencies, will add some extra changes in a later PR

@ajdecon ajdecon merged commit e5aab2b into NVIDIA:master Nov 30, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants