Optimizing performance and sustainability for ai

Watch talk on YouTube

A panel discussion with moderation by Google and participants from Google, Alluxio, Ampere and CERN. It was pretty scripted with prepared (sponsor specific) slides for each question answered.

Takeaways

Deploying an ML should become the new deployment a web app
The hardware should be fully utilized -> Better resource sharing and scheduling
Smaller LLMs on CPU only is pretty cost-efficient
Better scheduling by splitting into storage + CPU (prepare) and GPU (run) nodes to create a just-in-time flow
Software acceleration is cool, but we should use more specialized hardware and models to run on CPUs
We should be flexible regarding hardware, multi-cluster workloads and hybrid (onprem, burst to cloud) workloads