NVIDIA · ajdecon · Apr 20, 2022 · Oct 1, 2021 · Nov 12, 2021 · Apr 13, 2022
diff --git a/docs/slurm-cluster/README.md b/docs/slurm-cluster/README.md
@@ -58,12 +58,126 @@ Instructions for deploying a GPU cluster with Slurm
    ansible-playbook -l slurm-cluster playbooks/slurm-cluster.yml
    ```
 
-6. Verify Pyxis and Enroot can run GPU jobs across all nodes.
+## Slurm Validation
+A Slurm validation playbook is provided. Please refer to
+"[slurm-validation.yml](../../playbooks/slurm-cluster/slurm-validation.yml)".
+
+The validation playbook will verify that Pyxis and Enroot can run GPU jobs
+across all nodes by running NCCL tests. The playbook has the following
+default parameters that can be overriden:
+```
+    # String; Container for nccl performance/validation tests. Either docker
+    #   tag or can be path to sqsh file.
+    base_container: "nvcr.io/nvidia/tensorflow:21.09-tf2-py3"
+
+    # String; Container to be created or one that might exist with nccl tests.
+    #   If `compile_nccl_tests` is True, it must be a sqsh file.
+    #   If `compile_nccl_tests` is False, it can be a docker tag or sqsh file.
+    nccl_tests_container: "${HOME}/enroot_images/nccl_tests_slurm_val.sqsh"
+
+    # Bool; Compile and add NCCL tests to the base_container outputing to
+    #   nccl_tests_container (will delete/overwrite if one already exists). If
+    #   false assumes nccl_tests_container already has the NCCL tests and uses
+    #   the nccl_tests_container.
+    compile_nccl_tests: True
+
+    # String; NCCL allreduce test command.
+    allreduce_command: "all_reduce_perf -b 1M -e 4G -f 2 -g 1"
+
+    # Int; Number of GPUs per node. DGX-1 and DGX A100 Server have 8 GPUs.
+    #   DGX-2 has 16 GPUs.
+    num_gpus: 8
+
+    # String; Slurm parition to use
+    partition: batch
+
+    # Time string; Time limit for the Slurm job.
+    timelimit: "10:00"
+
+    # String; Exports for srun command.
+    srun_exports: NCCL_DEBUG=INFO
+
+    # String; Custom srun options.
+    srun_options:
+
+    # Int or empty; Number of nodes. If empty uses all idle nodes on the partition.
+    num_nodes:
+
+    # Bool; Delete the `nccl_tests_container` after running the playbook, only
+    #   if `compile_nccl_tests` is true as well.
+    cleanup: False
+```
+
+The playbook vars control options for compiling NCCL tests. If the
+`compile_nccl_tests` is set to True (by default) a new enroot container will be
+built with NCCL tests. The `base_container` must already have NCCL library and
+MPI installed.  The enroot container is saved to the path set by
+`nccl_tests_container` var (must be a path to sqsh file).
+
+If one already compiled NCCL tests within a container, then set
+`compile_nccl_tests` to false, and set the `nccl_tests_container` to the
+container with NCCL tests (this can be a docker remote container, or local sqsh
+file).
+
+The default behavior is for the playbook to run multinode allreduce NCCL test
+on all idle nodes in the batch partition. It is possible to override `num_nodes`
+and run on fewer nodes or more nodes (to include idle nodes, but the srun
+command will be in the queue until the nodes become available). The variables
+are used to formulate the NCCL srun command:
+```sh
+srun --export={{ srun_exports }} \
+  -p {{ partition }} \
+  --time {{ timelimit }} \
+  -N {{ num_nodes }} \
+  --ntasks-per-node={{ num_gpus }} \
+  --gpus-per-task=1 \
+  --exclusive \
+  --mpi=pmi2 \
+  --no-container-remap-root \
+  --container-image="{{ nccl_tests_container }}" \
+  {{ srun_options }} \
+  {{ allreduce_command }}
+```
+
+Please refer to the following examples and adopt for your environment.
+
+NOTE: This will use Pyxis to download a container.
+
+1. Example to run on all idle nodes with default behavior.
+   ```sh
+   ansible-playbook -l slurm-cluster playbooks/slurm-cluster/slurm-validation.yml
+   ```
+   This will create a container "`${HOME}/enroot_images/nccl_tests_slurm_val.sqsh`"
+   which has to be manually deleted later if desired.
+
+2. Example to run on 2 nodes with PyTorch base container, use custom location
+   for compiled nccl container, disable UCX and HCOLL, then cleanup.
+   ```sh
+   ansible-playbook -l slurm-cluster playbooks/slurm-cluster/slurm-validation.yml \
+     -e '{base_container: nvcr.io/nvidia/pytorch:21.09-py3}' \
+     -e '{nccl_tests_container: "${HOME}/enroot_images/nccl_tests_torch_val.sqsh"}' \
+     -e '{num_nodes: 2}' \
+     -e '{srun_exports: "NCCL_DEBUG=INFO,OMPI_MCA_pml=^ucx,OMPI_MCA_coll=^hcoll"}' \
+     -e '{cleanup: True}'
+   ```
+
+3. Example to run on 1 node using existing NCCL container from a docker repo.
+   ```sh
+   ansible-playbook -l slurm-cluster playbooks/slurm-cluster/slurm-validation.yml \
+     -e '{nccl_tests_container: deepops/nccl-tests-tf20.06-ubuntu18.04:latest}' \
+     -e '{compile_nccl_tests: False}' \
+     -e '{num_nodes: 1}'
+   ```
+
+Pay attention to the playbook output in the terminal. The NCCL compilation and
+srun command will be printed. Pyxis and PMI are used with srun for orchestrating
+containers and multinode MPI. The results of "Out of bounds values" and "Avg bus
+bandwidth" are printed. The "Out of bounds values" should be 0 otherwise the
+test is considered FAIL. The bandwidth will vary depending on the network. The
+NCCL allreduce test results are written out to "`/tmp/nccl_tests.out`" after a
+successful playbook run. If running NCCL tests fails the error results are
+saved to "`/tmp/nccl_tests.err`". Refer to these file for detailed analysis.
 
-  ```sh
-  # NOTE: This will use Pyxis to download a container and verify GPU functionality across all compute nodes 
-  ansible-playbook -l slurm-cluster playbooks/slurm-cluster/slurm-validation.yml -e '{num_gpus: 1}'
-  ```
 ## Using Slurm
 
 Now that Slurm is installed, try a ["Hello World" example using MPI](../../workloads/examples/slurm/mpi-hello/README.md).

diff --git a/docs/slurm-cluster/slurm-perf-cluster.md b/docs/slurm-cluster/slurm-perf-cluster.md
@@ -149,7 +149,14 @@ High-Performance Multi-Node Cluster Deployment Guide
 
 ## Performance Validation
 
-   The `slurm-validation.yml` playbook connects to the login node and executes the NCCL tests against all nodes and GPUs. This checks both the correctness and the performance of the cluster. For a full explanation of what these tests do and what the [results mean](https://github.com/NVIDIA/nccl-tests/blob/master/doc/PERFORMANCE.md) see the official [NCCL Tests documentation](https://github.com/NVIDIA/nccl-tests).
+   The `slurm-validation.yml` playbook connects to the login node and executes
+   the NCCL tests against all nodes and GPUs. Refer to
+   ["Slurm Validation"](./README.md#slurm-validation) for details on running
+   the playbook. This checks both the correctness and the performance of the
+   cluster. For a full explanation of what these tests do and what the
+   [results mean](https://github.com/NVIDIA/nccl-tests/blob/master/doc/PERFORMANCE.md)
+   see the official
+   [NCCL Tests documentation](https://github.com/NVIDIA/nccl-tests).
 
    ```sh
    # Verify Slurm connectivity across all nodes

diff --git a/playbooks/slurm-cluster/slurm-validation.yml b/playbooks/slurm-cluster/slurm-validation.yml
@@ -1,33 +1,213 @@
 ---
-# Playbook designed to run a NCCL test across all nodes in a cluster of DGX-1s
+# Playbook designed to run a NCCL test across nodes in a cluster of DGXs
+# Example to run on two nodes with no UCX, no HCOLL, and enp set for
+# out-of-band NCCL init:
+#   ansible-playbook -l slurm-cluster playbooks/slurm-cluster/slurm-validation.yml \
+#     -e '{num_nodes: 2}' \
+#     -e '{srun_exports: "NCCL_DEBUG=INFO,OMPI_MCA_pml=^ucx,OMPI_MCA_coll=^hcoll,NCCL_SOCKET_IFNAME=enp"}' \
+#     -e '{cleanup: True}'
 - hosts: slurm-master[0]
+
   vars:
-    nccl_test_repo: "deepops/nccl-tests-tf20.06-ubuntu18.04:latest" # Public repository for nccl performance/validation tests
-    num_gpus: 8 # A DGX A100 Server has 8 GPUs
+    # String; Container for nccl performance/validation tests. Either docker
+    #   repo or can be path to sqsh file.
+    base_container: "nvcr.io/nvidia/tensorflow:21.09-tf2-py3"
+    # String; Container to be created or one that might exist with nccl tests.
+    #   If `compile_nccl_tests` is True, it must be a sqsh file.
+    nccl_tests_container: "${HOME}/enroot_images/nccl_tests_slurm_val.sqsh"
+    # Bool; Compile and add NCCL tests to the base_container outputing to
+    #   nccl_tests_container (will delete/overwrite if one already exists). If
+    #   false assumes nccl_tests_container already has the NCCL tests and uses
+    #   the nccl_tests_container.
+    compile_nccl_tests: True
+    # String; NCCL allreduce test command.
+    allreduce_command: "all_reduce_perf -b 1M -e 4G -f 2 -g 1"
+    # Int; Number of GPUs per node. DGX-1 and DGX A100 Server have 8 GPUs.
+    #   DGX-2 has 16 GPUs.
+    num_gpus: 8
+    # String; Slurm parition to use
+    partition: batch
+    # Time string; Time limit for the Slurm job.
+    timelimit: "10:00"
+    # String; Exports for srun command.
+    srun_exports: NCCL_DEBUG=INFO
+    # String; Custom srun options.
+    srun_options:
+    # Int or empty; Number of nodes. If empty uses all idle nodes on the partition.
+    num_nodes:
+    # Bool; Delete the `nccl_tests_container` after running the playbook, only
+    #   if `compile_nccl_tests` is true as well.
+    cleanup: False
+
   tasks:
+    - name: Check that NCCL sqsh base container exists when sqsh is specified.
+      file:
+        path: "{{ base_container }}"
+        state: file
+      when:
+        - compile_nccl_tests|bool == True
+        - base_container | splitext | last == ".sqsh"
+
+    - name: Check nccl_tests_container is sqsh path when compiling.
+      fail:
+        msg: >
+          When compiling the `nccl_tests_container` must be a path to enroot
+          ".sqsh" file. Currently it is set to:
+            "{{ nccl_tests_container }}"
+      when:
+        - compile_nccl_tests|bool == True
+        - nccl_tests_container | splitext | last != ".sqsh"
+
+    - name: Set nccl tests compilation command
+      set_fact:
+        ncclmake: |
+          mkdir -p /opt/nccl_tests
+          cd /opt/nccl_tests
+          NCCL_TESTS_COMMITISH=f773748b46
+          wget -q -O - https://github.com/NVIDIA/nccl-tests/archive/${NCCL_TESTS_COMMITISH}.tar.gz | \
+            tar --strip-components=1 -xzf - \
+          && CC=mpicc CXX=mpicxx make MPI=1
+          cp -R /opt/nccl_tests/build/* /usr/local/bin/
+      when: compile_nccl_tests|bool == True
+
+    - name: Print nccl make command
+      debug:
+        msg:
+          - |
+            Will compile container:
+              {{ nccl_tests_container }}
+            {% if not cleanup|bool %}Delete container manually afterwards.{% endif %}
+          - "NCCL compilation command:\n{{ ncclmake }}"
+      when: compile_nccl_tests|bool == True
+
+    - name: Check or create enroot images directory for nccl_tests_container.
+      file:
+        path: "{{ nccl_tests_container | dirname }}"
+        state: directory
+      when: compile_nccl_tests|bool == True
+
+    - name: Remove NCCL container if re-compiling.
+      file:
+        path: "{{ nccl_tests_container }}"
+        state: absent
+      when: compile_nccl_tests|bool == True
+
+    - name: Compiling NCCL tests
+      shell:
+        cmd: |
+          srun -p {{ partition }} -N 1 \
+            --ntasks-per-node=1 \
+            --cpus-per-task=10 \
+            --container-image="{{ base_container }}" \
+            --container-save="{{ nccl_tests_container }}" \
+            --container-remap-root \
+            {{ srun_options }} \
+            bash -c '{{ ncclmake }}'
+        creates: "{{ nccl_tests_container }}"
+      when: compile_nccl_tests|bool == True
+
+    - name: Check that nccl_tests_container exists when sqsh is specified.
+      file:
+        path: "{{ nccl_tests_container }}"
+        state: file
+      when:
+        - compile_nccl_tests|bool == False
+        - nccl_tests_container | splitext | last == ".sqsh"
+
+    - name: Print nccl_tests_container setting if not sqsh file.
+      debug:
+        msg: >
+          The nccl_tests_container is not set to sqsh file. Assuming it is a
+          valid docker tag. Currently it is set to:
+            "{{ nccl_tests_container }}"
+          WARNING: No validation is performed.
+      when:
+        - compile_nccl_tests|bool == False
+        - nccl_tests_container | splitext | last != ".sqsh"
+
     - name: Get node count from sinfo
       shell: >
-                sinfo |
-                tail -n +2  |
-                awk '{sum += $4} END {print sum}'
+        sinfo -p {{ partition }} -t idle |
+        tail -n +2  |
+        awk '{sum += $4} END {print sum}'
       register: node_out
+      when: (not num_nodes) or (num_nodes|int <= 0)
 
     - name: Set num_nodes variable
       set_fact:
-        num_nodes: "{{ node_out.stdout }}"
-    - name: Set cmd variable
+        num_nodes_: "{{ node_out.stdout | default(num_nodes) }}"
+
+    - name: Set nccl run command
       set_fact:
-        cmd: "srun -N {{ num_nodes }} -G {{ num_nodes|int * num_gpus }} --ntasks-per-node={{ num_gpus }} --mpi=pmix --exclusive  --container-image={{ nccl_test_repo }} all_reduce_perf -b 1M -e 4G -f 2 -g {{ num_gpus }}"
-    - name: Print node/gpu counts
+        ncclcmd: |
+          srun --export={{ srun_exports }} \
+            -p {{ partition }} \
+            --time {{ timelimit }} \
+            -N {{ num_nodes_ }} \
+            --ntasks-per-node={{ num_gpus }} \
+            --gpus-per-task=1 \
+            --exclusive \
+            --mpi=pmi2 \
+            --no-container-remap-root \
+            --container-image="{{ nccl_tests_container }}" \
+            {{ srun_options }} \
+            {{ allreduce_command }}
+
+    - name: Print node/gpu counts and nccl tests run command
       debug:
-              msg:
-                - "Detected {{ num_nodes }} nodes with {{ num_gpus }} gpus each."
-                - "Proceeding to run validation test, this may take several minutes: {{ cmd }}."
+        msg:
+          - "Detected {{ num_nodes_ }} nodes with {{ num_gpus }} gpus each."
+          - "Proceeding to run validation test, this may take several
+            minutes:\n{{ ncclcmd }}"
 
-    - name: Execute NCCL test across all nodes and GPUs
-      shell: >
-              srun -N {{ num_nodes }} -G {{ num_nodes|int * num_gpus }} --ntasks-per-node={{ num_gpus }} --mpi=pmix --exclusive  --container-image={{ nccl_test_repo }} all_reduce_perf -b 1M -e 4G -f 2 -g {{ num_gpus }}
-      register: out
-    - name: Print results
+    - block:
+      - name: Execute NCCL allreduce test
+        shell:
+          cmd: "{{ ncclcmd }}"
+        register: result
+        no_log: true
+
+      rescue:
+        - name: NCCL allreduce running error
+          debug:
+            msg: "{{ result }}"
+
+        - name: Save error results to /tmp/nccl_tests.err
+          local_action: 
+            module: copy 
+            content: "STDOUT:\n{{ result.stdout }}\n\nSTDERR:\n{{ result.stderr }}"
+            dest: /tmp/nccl_tests.err
+
+        - name: Fail Slurm validation playbook
+          fail:
+            msg: See debug of stderr above. Also refer to "/tmp/nccl_tests.err"
+
+    - name: Save results to /tmp/nccl_tests.out
+      local_action: 
+        module: copy 
+        content: "STDERR:\n{{ result.stderr }}\n\nSTDOUT:\n{{ result.stdout }}"
+        dest: /tmp/nccl_tests.out
+
+    - name: Extract "Out of bounds values" and "Avg bus bandwidth".
+      shell:
+        cmd: |
+          echo "{{ result.stdout }}" | \
+            grep "Out of bounds values\|Avg bus bandwidth"
+      register: result_vals
+      when: result.stdout != ""
+
+    - name: Print and analyze results
       debug: 
-        msg: "{{ out }}"
+        msg:
+          - "{{ result_vals.stdout }}"
+          - Out of bounds values should be 0 otherwise FAIL.
+          - Bandwidth will vary based on GPU topology and network interconnect types.
+          - Refer to /tmp/nccl_tests.out for detailed NCCL allreduce output.
+
+    - name: Cleanup NCCL container.
+      file:
+        path: "{{ nccl_tests_container }}"
+        state: absent
+      when:
+        - cleanup|bool == True
+        - compile_nccl_tests|bool == True