Enhanced nccl tests slurm validation playbook.#1042
Conversation
c7e4ce8 to
86a31fd
Compare
* Parameterize NCCL tests slurm validation playbook. * Add filesystem checks for sqsh enroot container.
0602763 to
3b12dff
Compare
| --- | ||
| # Playbook designed to run a NCCL test across all nodes in a cluster of DGX-1s | ||
| # Playbook designed to run a NCCL test across nodes in a cluster of DGXs | ||
| # Example to run on NSL-B with two nodes: |
There was a problem hiding this comment.
We shouldn't have references to internal clusters if it can be avoided.
There was a problem hiding this comment.
I can reword to:
# Example to run on two nodes with enp set for out-of-band NCCL init, no UCX and no HCOLL:
Or please suggest how to reword. This example will work in clusters without RDMA (no RoCE or infiniband). I want to document in the comments how to run on such clusters. Sometimes without the RDMA setup it is useful to test such cases.
ajdecon
left a comment
There was a problem hiding this comment.
LGTM. Two minor nit-picks mentioned inline but I don't think those are worth blocking this.
| cmd: | | ||
| srun -p {{ partition }} -N 1 \ | ||
| --ntasks-per-node=1 \ | ||
| --cpus-per-task=10 \ |
There was a problem hiding this comment.
cpus-per-task should have a variable, so it can run on hosts with fewer than 10 cores. 😄
| for compiled nccl container, disable UCX and HCOLL, then cleanup. | ||
| ```sh | ||
| ansible-playbook -l slurm-cluster playbooks/slurm-cluster/slurm-validation.yml \ | ||
| -e '{base_container: nvcr.io/nvidia/pytorch:21.09-py3}' \ |
There was a problem hiding this comment.
Container format should be compatible with Enroot usage and anonymous access, e.g.
-e '{base_container: "nvcr.io#nvidia/pytorch:21.09-py3"}'
Ability to specify any base container and dynamically compile a new enroot container with NCCL tests. Then runs the NCCL tests using the compiled enroot container.