Accelerating AI workloads with GPUs in kubernetes

Watch talk on YouTube

Kevin and Sanjay from NVIDIA

Enabling GPUs in Kubernetes today

Host level components: Toolkit, drivers
Kubernetes components: Device plugin, feature discovery, node selector
NVIDIA humbly brings you a GPU operator

Time slicing: Switch around by time
Multi Process Service: Always run on the GPU but share (space-)
Multi Instance GPU: Space-seperated sharing on the hardware
Virtual GPU: Virtualizes Time slicing or MIG
CUDA Streams: Run multiple kernels in a single app

Dynamic resource allocation

A new alpha feature since Kube 1.26 for dynamic resource requesting
You just request a resource via the API and have fun
The sharing itself is an implementation detail

GPU scale-out challenges

NVIDIA Picasso is a foundry for model creation powered by Kubernetes
The workload is the training workload split into batches
Challenge: Schedule multiple training jobs by different users that are prioritized

Topology aware placements

You need thousands of GPUs, a typical Node has 8 GPUs with fast NVLink communication - beyond that switching
Target: optimize related jobs based on GPU node distance and NUMA placement

Fault tolerance and resiliency

Stuff can break, resulting in slowdowns or errors
Challenge: Detect faults and handle them
Observability both in-band and out of band that expose node conditions in Kubernetes
Needed: Automated fault-tolerant scheduling

Multidimensional optimization

There are different KPIs: starvation, priority, occupancy, fairness
Challenge: What to choose (the multidimensional decision problem)
Needed: A scheduler that can balance the dimensions