Table of Contents
This document will focus on the steps required to run a non-sliced nVidia GPU on a kubernetes cluster with kubeadm and containerd on a RHEL or RHEL clone system. Most of the steps in this document are just differences from my previous Kubernetes on Linux with Kubeadm documentation. So please read through that first.
This is not a guide on how to create a production ready/hardened environment.
This guide is targeted to kubernetes v1.20 - v1.29.
If you are attempting to use this guide for another kubernetes version, please be aware that kubernetes is a quickly changing application and this guide may be out-of-date and incorrect. You have been warned.
This assumes that you have a kubernetes node configured and running with nVidia drivers. Please take a look at the instructions at RPM-fusion for more instruction on how to configure the system for the nVidia kernel module.
Configuring Our Host
nVidia container toolkit
We will need to add the nVidia container toolkit. The easiest way to do this is to add the nVidia container toolkit rpm repository to our system so to make updates and patching easier. nVidia supplies instructions for this here, but to preserve this, I'm also including it here:
cat <<EOF | sudo tee /etc/yum.repos.d/nvidia.repo
Then we need to install the toolkit:
sudo dnf install -y nvidia-container-toolkit
In order to get containerd to work with the nVidia system, we need to tell containerd to use the nVidia container toolkit we just-installed. Official instructions are provided here. These instructions will have you run a command to modify containerd's configuration and restart the containerd service:
sudo nvidia-ctk runtime configure --runtime=containerd
sudo systemctl restart containerd.service
The command makes the following modifications to the file:
default_runtime_name = "nvidia"
privileged_without_host_devices = false
runtime_type = "io.containerd.runc.v2"
BinaryName = "/usr/bin/nvidia-container-runtime"
SystemdCgroup = true
This basically sets the the containerd runtime to use the nvidia-container-runtime and the systemd-cgroup. Both of these components are critical, so make sure you don't miss one.
The last step is to install the nVidia device plugin daemonset. This daemonset runs a container on each node and automatically updates the node capacity labels to include gpu resources (
nvidia.com/gpu) if there is a detected gpu present. This is documented here.
kubectl create -f https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/v0.14.3/nvidia-device-plugin.yml
GPU Consumer Deployment
Deployment is really just a matter of telling kubernetes that the deployment will consume a
nvidia.com/gpu resource and the kubernetes scheduler will assign the pod to a node where an unused gpu is available.
- name: my-gpu-consumer