Optimizing performance and sustainability for ai
Watch talk on YouTubeA panel discussion with moderation by Google and participants from Google, Alluxio, Ampere and CERN. It was pretty scripted with prepared (sponsor specific) slides for each question answered.
Takeaways
- Deploying an ML should become the new deployment a web app
- The hardware should be fully utilized -> Better resource sharing and scheduling
- Smaller LLMs on CPU only is pretty cost-efficient
- Better scheduling by splitting into storage + CPU (prepare) and GPU (run) nodes to create a just-in-time flow
- Software acceleration is cool, but we should use more specialized hardware and models to run on CPUs
- We should be flexible regarding hardware, multi-cluster workloads and hybrid (onprem, burst to cloud) workloads