Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
94 changes: 66 additions & 28 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,66 +1,104 @@
DeepOps
===

GPU infrastructure and automation tools
# DeepOps

Infrastructure automation tools for Kubernetes and Slurm clusters with NVIDIA GPUs.

## Table of Contents

- [DeepOps](#deepops)
- [Table of Contents](#table-of-contents)
- [Overview](#overview)
- [Releases Notes](#releases-notes)
- [Deployment Requirements](#deployment-requirements)
- [Provisioning System](#provisioning-system)
- [Cluster System](#cluster-system)
- [Kubernetes](#kubernetes)
- [Slurm](#slurm)
- [Hybrid clusters](#hybrid-clusters)
- [Virtual](#virtual)
- [Updating DeepOps](#updating-deepops)
- [Copyright and License](#copyright-and-license)
- [Issues](#issues)
- [Contributing](#contributing)

## Overview

The DeepOps project encapsulates best practices in the deployment of GPU server clusters and sharing single powerful nodes (such as [NVIDIA DGX Systems](https://www.nvidia.com/en-us/data-center/dgx-systems/)). DeepOps can also be adapted or used in a modular fashion to match site-specific cluster needs. For example:

* An on-prem data center of NVIDIA DGX servers where DeepOps provides end-to-end capabilities to set up the entire cluster management stack
* An existing cluster running Kubernetes where DeepOps scripts are used to deploy Kubeflow and connect NFS storage
* An existing cluster that needs a resource manager / batch scheduler, where DeepOps is used to install Slurm or Kubernetes
* A single machine where no scheduler is desired, only NVIDIA drivers, Docker, and the NVIDIA Container Runtime
The DeepOps project encapsulates best practices in the deployment of GPU server clusters and sharing single powerful nodes (such as [NVIDIA DGX Systems](https://www.nvidia.com/en-us/data-center/dgx-systems/)). DeepOps may also be adapted or used in a modular fashion to match site-specific cluster needs. For example:

Check out the [video tutorial](https://drive.google.com/file/d/1RNLQYlgJqE8JMv0np8SdEDqeCN2piavF/view) for how to use DeepOps to deploy Kubernetes and Kubeflow on a single DGX Station. This provides a good base test ground for larger deployments.
- An on-prem data center of NVIDIA DGX servers where DeepOps provides end-to-end capabilities to set up the entire cluster management stack
- An existing cluster running Kubernetes where DeepOps scripts are used to deploy KubeFlow and connect NFS storage
- An existing cluster that needs a resource manager / batch scheduler, where DeepOps is used to install Slurm or Kubernetes
- A single machine where no scheduler is desired, only NVIDIA drivers, Docker, and the NVIDIA Container Runtime

## Releases
## Releases Notes

Latest release: [DeepOps 22.04 Release](https://github.com/NVIDIA/deepops/releases/tag/22.04)

- Kubernetes Default Components:

- [kubernetes](https://github.com/kubernetes/kubernetes) v1.22.8
- [etcd](https://github.com/coreos/etcd) v3.5.0
- [docker](https://www.docker.com/) v20.10
- [containerd](https://containerd.io/) v1.5.8
- [cri-o](http://cri-o.io/) v1.22
- [calico](https://github.com/projectcalico/calico) v3.20.3
- [dashboard](https://github.com/kubernetes/dashboard/tree/master) v2.0.3
- [dashboard metrics scraper](https://github.com/kubernetes-sigs/dashboard-metrics-scraper/tree/master) v1.0.4
- [nvidia gpu operator](https://github.com/NVIDIA/gpu-operator/tree/master) 1.10.0

- Slurm Default Components:

- [slurm](https://github.com/SchedMD/slurm/tree/master) 21.08.8-2
- [Singularity](https://github.com/apptainer/singularity/tree/master) 3.7.3
- [docker](https://www.docker.com/) v20.10

It is recommended to use the latest release branch for stable code (linked above). All development takes place on the master branch, which is generally [functional](docs/deepops/testing.md) but may change significantly between releases.

## Getting Started
## Deployment Requirements

For detailed help or guidance, read through our [Getting Started Guide](docs/) or pick one of the deployment options documented below.
### Provisioning System

## Deployment Options
The provisioning system is used to orchestrate the running of all playbooks and one will be needed when instantiating Kubernetes or Slurm clusters. Supported operating systems which are tested and supported include:

### Supported Ansible versions
- NVIDIA DGX OS 4, 5
- Ubuntu 18.04 LTS, 20.04 LTS
- CentOS 7, 8

DeepOps supports using Ansible 2.9.x.
Ansible 2.10.x and newer are not currently supported.
### Cluster System

### Supported distributions
The cluster nodes will follow the requirements described by Slurm or Kubernetes. You may also use a cluster node as a provisioning system but it is not required.

DeepOps currently supports the following Linux distributions:
- NVIDIA DGX OS 4, 5
- Ubuntu 18.04 LTS, 20.04 LTS
- CentOS 7, 8

* NVIDIA DGX OS 4, 5
* Ubuntu 18.04 LTS, 20.04 LTS
* CentOS 7, 8
You may also install a supported operating system on all servers via a 3rd-party solution (i.e. [MAAS](https://maas.io/), [Foreman](https://www.theforeman.org/)) or utilize the provided [OS install container](docs/pxe/minimal-pxe-container.md).

### Kubernetes

Kubernetes (K8s) is an open-source system for automating deployment, scaling, and management of containerized applications.
Kubernetes (K8s) is an open-source system for automating deployment, scaling, and management of containerized applications. The instantiation of a Kubernetes cluster is done by [Kubespray](submodules/kubespray). Kubespray runs on bare metal and most clouds, using Ansible as its substrate for provisioning and orchestration. For people with familiarity with Ansible, existing Ansible deployments or the desire to run a Kubernetes cluster across multiple platforms, Kubespray is a good choice. Kubespray does generic configuration management tasks from the "OS operators" ansible world, plus some initial K8s clustering (with networking plugins included) and control plane bootstrapping. DeepOps provides additional playbooks for orchestration and optimization of GPU environments.

Consult the [DeepOps Kubernetes Deployment Guide](docs/k8s-cluster/) for instructions on building a GPU-enabled Kubernetes cluster using DeepOps.

For more information on Kubernetes in general, refer to the [official Kubernetes docs](https://kubernetes.io/docs/concepts/overview/what-is-kubernetes/).

### Slurm

Slurm is an open source, fault-tolerant, and highly scalable cluster management and job scheduling system for large and small Linux clusters.
Slurm is an open-source cluster resource management and job scheduling system that strives to be simple, scalable, portable, fault-tolerant, and interconnect agnostic. Slurm currently has been tested only under Linux.

As a cluster resource manager, Slurm provides three key functions. First, it allocates exclusive and/or non-exclusive access to resources (compute nodes) to users for some duration of time so they can perform work. Second, it provides a framework for starting, executing, and monitoring work (normally a parallel job) on the set of allocated nodes. Finally, it arbitrates conflicting requests for resources by managing a queue of pending work. Slurm cluster instantiation is achieved through [SchedMD](https://slurm.schedmd.com/download.html)

Consult the [DeepOps Slurm Deployment Guide](docs/slurm-cluster/) for instructions on building a GPU-enabled Slurm cluster using DeepOps.

For more information on Slurm in general, refer to the [official Slurm docs](https://slurm.schedmd.com/overview.html).

### Hybrid clusters

DeepOps doesn't currently test or support a configuration where both Kubernetes and Slurm are deployed on the same physical cluster.
Instead, we recommend [NVIDIA Bright Cluster Manager](https://www.brightcomputing.com/brightclustermanager) as a solution which enables managing multiple workload managers within a single cluster,
including Kubernetes, Slurm, Univa Grid Engine, and PBS Pro.
**DeepOps does not test or support a configuration where both Kubernetes and Slurm are deployed on the same physical cluster.**

[NVIDIA Bright Cluster Manager](https://www.brightcomputing.com/brightclustermanager) is recommended as an enterprise solution which enables managing multiple workload managers within a single cluster, including Kubernetes, Slurm, Univa Grid Engine, and PBS Pro.

**DeepOps does not test or support a configuration where nodes have a heterogenous OS running.**
Additional modifications are needed if you plan to use unsupported operating systems such as RHEL.

### Virtual

Expand Down
22 changes: 15 additions & 7 deletions docs/README.md
Original file line number Diff line number Diff line change
@@ -1,24 +1,32 @@
Getting Started
===
# Getting Started

- [Getting Started](#getting-started)
- [Requirements](#requirements)
- [Steps](#steps)
- [Configuration](#configuration)
- [Modularity](#modularity)
- [Scripts](#scripts)
- [Examples](#examples)
- [Docs](#docs)

## Requirements

* A pre-existing "provisioning" node which can be used to run Ansible and the install scripts
* A cluster to deploy to (potentially a cluster or single server - or a [virtual one](/virtual/README.md))
- A pre-existing "provisioning" node which can be used to run Ansible and the install scripts
- A cluster to deploy to (potentially a cluster or single server - or a [virtual one](/virtual/README.md))

## Steps

1. Pick a provisioning node to deploy from. This is where the Ansible scripts should be run from and is often a development laptop that has a connection to the target cluster. On this provisioning node, clone the DeepOps repository...

```
```bash
git clone https://github.com/NVIDIA/deepops.git
```

2. Checkout a recent release tag. This is an optional step, but if not done, the latest development code will be used, not an official release.

```
```bash
cd deepops
git checkout tags/21.09
git checkout tags/22.07
```

3. Pick one of the [Deployment Options](/README.md#deployment-options) mentioned in the main [README](/README.md), following the installation instructions.
Expand Down
40 changes: 20 additions & 20 deletions docs/airgap/README.md
Original file line number Diff line number Diff line change
@@ -1,29 +1,29 @@
Air-Gap Support
===============
# Air-Gap Support

DeepOps supports a number of configuration values for specifying alternate sources and URLs for downloading software.
These configuration values can be used to run DeepOps playbooks in environments without an Internet connection, assuming that the environment has an alternative mirror available to supply this software.
We currently don't supply our own automation to set up offline mirrors, but we do provide some basic documentation to illustrate how to set these mirrors up and use them.
Documentation for setting up clusters in air-gapped environments

- [Air-Gap Support](#air-gap-support)
- [Summary](#summary)
- [Setting up mirrors](#setting-up-mirrors)
- [Using mirrors to deploy offline](#using-mirrors-to-deploy-offline)
- [Dependency documentation](#dependency-documentation)

Setting up mirrors
------------------
## Introduction
DeepOps supports a number of configuration values for specifying alternate sources and URLs for downloading software. These configuration values can be used to run DeepOps playbooks in environments without an Internet connection, assuming that the environment has an alternative mirror available to supply this software. We currently don't supply our own automation to set up offline mirrors, but we do provide some basic documentation to illustrate how to set these mirrors up and use them.

* [Setting up offline mirrors for APT repositories](mirror-apt-repos.md)
* [Setting up offline mirrors for RPM repositories](mirror-rpm-repos.md)
* [Setting up an offline mirror for Docker container images](mirror-docker-images.md)
* [Setting up an offline mirror for HTTP downloads](mirror-http-files.md)
## Setting up mirrors

- [Setting up offline mirrors for APT repositories](mirror-apt-repos.md)
- [Setting up offline mirrors for RPM repositories](mirror-rpm-repos.md)
- [Setting up an offline mirror for Docker container images](mirror-docker-images.md)
- [Setting up an offline mirror for HTTP downloads](mirror-http-files.md)

Using mirrors to deploy offline
-------------------------------
## Using mirrors to deploy offline

* [Deploying the NGC-Ready playbook offline](ngc-ready.md)
* Deploying a Kubernetes cluster offline (TODO)
* Deploying a Slurm cluster offline (TODO)
- [Deploying the NGC-Ready playbook offline](ngc-ready.md)
- Deploying a Kubernetes cluster offline (TODO)
- Deploying a Slurm cluster offline (TODO)

## Dependency documentation

Dependency documentation
------------------------

* [Deploying Kubespray in an offline environment](https://github.com/kubernetes-sigs/kubespray/blob/master/docs/offline-environment.md)
- [Deploying Kubespray in an offline environment](https://github.com/kubernetes-sigs/kubespray/blob/master/docs/offline-environment.md)
Loading