Accelerating AI workloads with GPUs in kubernetes

Watch talk on YouTube

Kevin and Sanjay from NVIDIA

Enabling GPUs in Kubernetes today

  • Host level components: Toolkit, drivers
  • Kubernetes components: Device plugin, feature discovery, node selector
  • NVIDIA humbly brings you a GPU operator

GPU sharing

  • Time slicing: Switch around by time
  • Multi Process Service: Always run on the GPU but share (space-)
  • Multi Instance GPU: Space-seperated sharing on the hardware
  • Virtual GPU: Virtualizes Time slicing or MIG
  • CUDA Streams: Run multiple kernels in a single app

Dynamic resource allocation

  • A new alpha feature since Kube 1.26 for dynamic resource requesting
  • You just request a resource via the API and have fun
  • The sharing itself is an implementation detail

GPU scale-out challenges

  • NVIDIA Picasso is a foundry for model creation powered by Kubernetes
  • The workload is the training workload split into batches
  • Challenge: Schedule multiple training jobs by different users that are prioritized

Topology aware placements

  • You need thousands of GPUs, a typical Node has 8 GPUs with fast NVLink communication - beyond that switching
  • Target: optimize related jobs based on GPU node distance and NUMA placement

Fault tolerance and resiliency

  • Stuff can break, resulting in slowdowns or errors
  • Challenge: Detect faults and handle them
  • Observability both in-band and out of band that expose node conditions in Kubernetes
  • Needed: Automated fault-tolerant scheduling

Multidimensional optimization

  • There are different KPIs: starvation, priority, occupancy, fairness
  • Challenge: What to choose (the multidimensional decision problem)
  • Needed: A scheduler that can balance the dimensions