The Hitchhiker's Guide to Kubernetes Platforms: Don’t Panic, Just Launch!
Watch talk on YouTubeThis talk looks at bootstrapping Platforms using KServe. They do this in regard to AI Workflows.
Scenario
- Deploy AI Workloads - Sometime consisting of different parts
- Models get stored in a model registry
Baseline
- Consistent APIs throughout the platform
- Not the kube API directly b/c:
- Data scientists are a bit overpowered by the kube API
- Not only Kubernetes (also monitoring tools, feedback tools, etc.)
- Better debugging experience for specific workloads
The debugging API
- Specific API with enhanced statuses and consistent UX across Code and UI
- Example Endpoints: Pods, Deployments, InferenceServices
- Provides a status summary-> Consistent health info across all related resources
- Example: Deployments have progress/availability, Pods have phases, Containers have readiness -> What do we interpret how?
- Evaluation: Progressing, Available Count vs Readiness, Replicafailure, Pod Phase, Container Readiness
- The rules themselves may be pretty complex, but - since the user doesn’t have to check them themselves - the status is simple
Debugging Metrics
- Dashboards (Utilization, throughput, latency)
- Events
- Logs
Deployment API
- Launchpad: Just select your model and version -> The DB (dock) stores all manifests (Spaceship)
- Manifests relate to models from a model registry
- Multi-tenancy is implemented using k8s namespaces
- Kine is used to replace/extend etcd with the relational dock db -> Relation namespace<->manifests is stored here and RBAC can be used
- Launchpad: Select Namespace and check resource (fuel) availability/utilization
Cluster maintenance
- Deployments can be launched to multiple clusters (even two clusters at once) -> HA through identical clusters
- The exact same manifests get deployed to two clusters
- Cluster desired state is stored externally to enable effortless upgrades, rescale, etc
Versioning API
- Basically the dock DB
- CRDs are the representations of the inference manifests
- Rollbacks, Promotion and History is managed via the CRs
- Why not GitOps: Internal Diffs, deployment overrides, customized features
UX
- User driven API design
- Customized tools
- Everything gets 1:1 replicated for HA
- Large onboarding guide