Chapter 2

Day 2

Day two is also the official day one of KubeCon (Day one was just CloudNativeCon). This is where all the people joined (over 12000)

The opening keynotes were a mix of talks and panel discussions. The main topic was - who could have guessed - AI and ML.

Subsections of Day 2

Opening Keynote

The opening keynote started - as is the tradition with keynotes - with a “motivational” opening video. The keynote itself was presented by the CEO of the CNCF.

The numbers

  • Over 12000 attendees
  • 10 Years of Kubernetes
  • 60% of large organizations expect rapid cost increases due to AI/ML (FinOps Survey)

The highlights

  • Everyone uses cloud native
  • AI uses Kubernetes b/c the UX is way better than classic tools
    • Especially when transferring from dev to prod
    • We need standardization
  • Open source is cool

Live demo

  • KIND cluster on desktop
  • Prototype Stack (develop on client)
    • Kubernetes with the LLM
    • Host with LLAVA (image describe model), moondream and OLLAMA (the model manager/registry()
  • Prod Stack (All in kube)
    • Kubernetes with LLM, LLVA, OLLAMA, moondream
  • Available Models: LLAVA, mistral bokllava (LLAVA*mistral)
  • Host takes picture, AI describes what is pictures (in our case the conference audience)

AI Keynote discussion

A podium discussion (somewhat scripted) lead by Priyanka

Guests

  • Tim from Mistral
  • Paige from Google AI
  • Jeff founder of OLLAMA

Discussion

  • What do you use as the base of dev for OLLAMA
    • Jeff: The concepts from docker, git, Kubernetes
  • How is the balance between AI engineer and AI ops
    • Jeff: The classic dev vs ops divide, many ML-Engineer don’t think about
    • Paige: Yessir
  • How does infra keep up with the fast research
    • Paige: Well, they don’t - but they do their best and Cloud native is cool
    • Jeff: Well we’re not google, but Kubernetes is the savior
  • What are scaling constraints
    • Jeff: Currently sizing of models is still in its infancy
    • Jeff: There will be more specific hardware and someone will have to support it
    • Paige: Sizing also depends on latency needs (code autocompletion vs performance optimization)
    • Paige: Optimization of smaller models
  • What technologies need to be open source licensed
    • Jeff: The model b/c access and trust
    • Tim: The models and base execution environment -> Vendor agnosticism
    • Paige: Yes and remixes are really important for development
  • Anything else
    • Jeff: How do we bring our awesome tools (monitoring, logging, security) to the new AI world
    • Paige: Currently many people just use paid APIs to abstract the infra, but we need this stuff self-hostable
    • Tim: I don’t want to know about the hardware, the whole infra side should be done by the cloud native teams to let ML-Engineer to just be ML-Engine

Accelerating AI workloads with GPUs in kubernetes

Watch talk on YouTube

Kevin and Sanjay from NVIDIA

Enabling GPUs in Kubernetes today

  • Host level components: Toolkit, drivers
  • Kubernetes components: Device plugin, feature discovery, node selector
  • NVIDIA humbly brings you a GPU operator

GPU sharing

  • Time slicing: Switch around by time
  • Multi Process Service: Always run on the GPU but share (space-)
  • Multi Instance GPU: Space-seperated sharing on the hardware
  • Virtual GPU: Virtualizes Time slicing or MIG
  • CUDA Streams: Run multiple kernels in a single app

Dynamic resource allocation

  • A new alpha feature since Kube 1.26 for dynamic resource requesting
  • You just request a resource via the API and have fun
  • The sharing itself is an implementation detail

GPU scale-out challenges

  • NVIDIA Picasso is a foundry for model creation powered by Kubernetes
  • The workload is the training workload split into batches
  • Challenge: Schedule multiple training jobs by different users that are prioritized

Topology aware placements

  • You need thousands of GPUs, a typical Node has 8 GPUs with fast NVLink communication - beyond that switching
  • Target: optimize related jobs based on GPU node distance and NUMA placement

Fault tolerance and resiliency

  • Stuff can break, resulting in slowdowns or errors
  • Challenge: Detect faults and handle them
  • Observability both in-band and out of band that expose node conditions in Kubernetes
  • Needed: Automated fault-tolerant scheduling

Multidimensional optimization

  • There are different KPIs: starvation, priority, occupancy, fairness
  • Challenge: What to choose (the multidimensional decision problem)
  • Needed: A scheduler that can balance the dimensions

Sponsored: Build an open source platform for ai/ml

Watch talk on YouTube

Jorge Palma from Microsoft with a quick introduction.

Baseline

  • Kubernetes is cool and all
  • Challenges:
    • Containerized models
    • GPUs in the cluster (install, management)

Kubernetes AI Tool chain (KAITO)

  • Kubernetes operator that interacts with
    • Node provisioner
    • Deployment
  • Simple CRD that describes a model, infra and have fun
  • Creates inference endpoint
  • Models are currently 10 (Hugginface, LLMA, etc.)

Optimizing performance and sustainability for ai

Watch talk on YouTube

A panel discussion with moderation by Google and participants from Google, Alluxio, Ampere and CERN. It was pretty scripted with prepared (sponsor specific) slides for each question answered.

Takeaways

  • Deploying an ML should become the new deployment a web app
  • The hardware should be fully utilized -> Better resource sharing and scheduling
  • Smaller LLMs on CPU only is pretty cost-efficient
  • Better scheduling by splitting into storage + CPU (prepare) and GPU (run) nodes to create a just-in-time flow
  • Software acceleration is cool, but we should use more specialized hardware and models to run on CPUs
  • We should be flexible regarding hardware, multi-cluster workloads and hybrid (onprem, burst to cloud) workloads

Cloudnative news show (AI edition)

Watch talk on YouTube

Nikhita presented projects that merge cloud native and AI. Patrick Ohly Joined for DRA

The “news”

  • New work group AI
  • More tools are including AI features
  • New updated CNCF for children feat AI
  • One decade of Kubernetes
  • DRA is in alpha

DRA

  • A new API for resources (node-local and node-attached)
  • Sharing of resources between cods and containers
  • Vendor specific stuff are abstracted by a vendor driver controller
  • The kube scheduler can interact with the vendor parameters for scheduling and autoscaling

Cloud native AI ecosystem

  • Kube is the seed for the AI infra plant
  • Kubeflow users wanted AI registries
  • LLM on the edge
  • OpenTelemetry bring semantics
  • All of these tools form a symbiosis between
  • Topics of discussions

The working group AI

  • It was formed in October 2023
  • They are working on the white paper (cloud native and AI) which was published on 19.03.2024
  • The landscape “cloud native and AI” is WIP and will be merged into the main CNCF landscape
  • The future focus will be on security and cost efficiency (with a hint of sustainability)

LFAI and CNCF

  • The director of the AI foundation talks about AI and cloud native
  • They are looking forward to more collaboration

Is your image really distroless?

Watch talk on YouTube

Laurent Goderre from Docker. The entire talk was very short, but it was a nice demo of init containers

Baseline

  • Security is hard - distroless sounds like a nice helper
  • Basic Challenge: Usability-Security Dilemma -> But more usability doesn’t mean less secure, but more updating
  • Distro: Kernel + Software Packages + Package manager (optional) -> In Containers just without the kernel
  • Distroless: No package manager, no shell, no web client (curl/wget) - only minimal software bundles

Tools for distroless image creation

  • Multi-Stage Builds: No cleanup needed and better caching
  • Buildkit: More complex, but a pluggable build architecture

The title question

  • Well many images don’t include a package manager, but a shell and some tools (busybox)
  • Tools are usually included as config-time tools (init) -> They just stay around after init - unused
  • Solution: Our lord and savior init containers without any inbound traffic that just does config stuff

Demo

  • A (rough) distroless Postgres with alpine build step and scratch final step
  • A basic pg:alpine container used for init with a shared data volume
  • The init uses the pg admin user to initialize the pg server (you don’t need the admin credentials after this)

Kube

  • K apply failed b/c no internet, but was fixed by connecting to Wi-Fi
  • Without the init container the pod just crashes, with the init container the correct config gets created

Docker compose

  • Just use service_completed_successfully condition in depends on

Building a large scale multi-cloud multi-region SaaS platform with kubernetes controllers

Watch talk on YouTube

Interchangeable wording in this talk: controller == operator

A talk by elastic.

About elastic

  • Elastic cloud as a managed service
  • Deployed across AWS/GCP/Azure in over 50 regions
  • 600000+ Containers

Elastic and Kube

  • They offer elastic observability
  • They offer the ECK operator for simplified deployments

The baseline

  • Goal: A large scale (1M+ containers) resilient platform on k8s
  • Architecture
    • Global Control: The control plane (API) for users with controllers
    • Regional Apps: The “shitload” of Kubernetes clusters where the actual customer services live

Scalability

  • Challenge: How large can our cluster be, how many clusters do we need
  • Problem: Only basic guidelines exist for that
  • Decision: Horizontally scale the number of clusters (500-1K nodes each)
  • Decision: Disposable clusters
    • Throw away without data loss
    • Single source of truth is not cluster etcd but external -> No etcd backups needed
    • Everything can be recreated any time

Controllers

Note

I won’t copy the explanations of operators/controllers in these notes

  • Many controllers, including (but not limited to)
    • cluster controller: Register cluster to controller
    • Project controller: Schedule user’s project to cluster
    • Product controllers (Elasticsearch, Kibana, etc.)
    • Ingress/Cert manager
  • Sometimes controllers depend on controllers -> potential complexity
  • Pro:
    • Resilient (Self-healing)
    • Level triggered (desired state vs procedure triggered)
    • Simple reasoning when comparing desired state vs state machine
    • Official controller runtime lib
  • Workqueue: Automatic Dedup, Retry back off and so on

Global Controllers

  • Basic operation
    • Uses project config from Elastic cloud as the desired state
    • The actual state is a k9s resource in another cluster
  • Challenge: Where is the source of truth if the data is not stored in etcd
  • Solution: External data store (Postgres)
  • Challenge: How do we sync the db sources to Kubernetes
  • Potential solutions: Replace etcd with the external db
  • Chosen solution:
    • The controllers don’t use CRDs for storage, but they expose a web-API
    • Reconciliation still now interacts with the external db and go channels (queue) instead
    • Then the CRs for the operators get created by the global controller

Large scale

  • Problem: Reconcile gets triggered for all objects on restart -> Make sure nothing gets missed and is used with the latest controller version
  • Idea: Just create more workers for 100K+ Objects
  • Problem: CPU go brrr and db gets overloaded
  • Problem: If you create an item during restart, suddenly it is at the end of a 100Kü item work-queue

Reconcile

  • User-driven events are processed asap
  • reconcile of everything should happen, bus with low priority slowly in the background
  • Solution: Status: LastReconciledRevision (timestamp) gets compare to revision, if larger -> User change
  • Prioritization: Just a custom event handler with the normal queue and a low priority
  • Queue: Just a queue that adds items to the normal work-queue with a rate limit
flowchart LR
    low-->rl(ratelimit)
    rl-->wq(work queue)
    wq-->controller
    high-->wq
  • Argo for CI/CD
  • Crossplane for cluster autoprovision

Safety or usability: Why not both? Towards referential auth in k8s

Watch talk on YouTube

A talk by Google and Microsoft with the premise of better auth in k8s.

Baselines

  • Most access controllers have read access to all secrets -> They are not really designed for keeping these secrets
  • Result: CVEs
  • Example: Just use ingress, nginx, put in some Lua code in the config and e voilà: Service account token
  • Fix: No more fun

Basic solutions

  • Separate Control (the controller) from data (the ingress)
  • Namespace limited ingress

Current state of cross namespace stuff

  • Why: Reference TLS cert for gateway API in the cert team’s namespace
  • Why: Move all ingress configs to one namespace
  • Classic Solution: Annotations in contour that references a namespace that contains all certs (rewrites secret to certs/secret)
  • Gateway Solution:
    • Gateway TLS secret ref includes a namespace
    • ReferenceGrant pretty much allows referencing from X (Gateway) to Y (Secret)
  • Limits:
    • Has to be implemented via controllers
    • The controllers still have read all - they just check if they are supposed to do this

Goals

Global

  • Grant access to controller to only resources relevant for them (using references and maybe class segmentation)
  • Allow for safe cross namespace references
  • Make it easy for API devs to adopt it

Personas

  • Alex API author
  • Kai controller author
  • Rohan Resource owner

What our stakeholders want

  • Alex: Define relationships via ReferencePatterns
  • Kai: Specify controller identity (Serviceaccount), define relationship API
  • Rohan: Define cross namespace references (aka resource grants that allow access to their resources)

Result of the paper

Architecture

  • ReferencePattern: Where do i find the references -> example: GatewayClass in the gateway API
  • ReferenceConsumer: Who (Identity) has access under which conditions?
  • ReferenceGrant: Allow specific references

POC

  • Minimum access: You only get access if the grant is there AND the reference actually exists
  • Their basic implementation works with the kube API

Open questions

  • Naming
  • Make people adopt this
  • What about namespace-scoped ReferenceConsumer
  • Is there a need of RBAC verb support (not only read access)

Alternative

  • Idea: Just extend RBAC Roles with a selector (match labels, etc.)
  • Problems:
    • Requires changes to Kubernetes core auth
    • Everything bus list and watch is a pain
    • How do you handle AND vs OR selection
    • Field selectors: They exist
  • Benefits: Simple controller implementation

Meanwhile

  • Prefer tools that support isolation between controller and data-plane
  • Disable all non-needed features -> Especially scripting

Developers Demand UX for K8s!

Watch talk on YouTube

A talk by UX and software people at Red Hat (Podman team). The talk mainly followed the academic study process (aka this is the survey I did for my bachelor’s/master’s thesis).

Research

  • User research Study including 11 devs and platform engineers over three months
  • Focus was on a new Podman desktop feature
  • Experience range 2-3 years experience average (low no experience, high old school kube)
  • 16 questions regarding environment, workflow, debugging and pain points
  • Analysis: Affinity mapping

Findings

  • Where do I start when things are broken? -> There may be solutions, but devs don’t know about them
  • Network debugging is hard b/c many layers and problems occurring in between CNI and infra are really hard -> Network topology issues are rare but hard
  • YAML indentation -> Tool support is needed for visualization
  • YAML validation -> Just use validation in dev and GitOps
  • YAML Cleanup -> Normalize YAML (order, anchors, etc.) for easy diff
  • Inadequate security analysis (too verbose, non-issues are warnings) -> Real-time insights (and during dev)
  • Crash Loop -> Identify stuck containers, simple debug containers
  • CLI vs GUI -> Enable experience level oriented GUI, Enhance in-time troubleshooting

General issues

  • No direct fs access
  • Multiple kubeconfigs
  • SaaS is sometimes only provided on kube, which sounds like complexity
  • Where do I begin my troubleshooting
  • Interoperability/Fragility with updates

Comparing sidecarless service mesh from cilium and istio

Watch talk on YouTube

Global field CTO at Solo.io with a hint of service mesh background.

History

  • LinkerD 1.X was the first modern service mesh and basically an opt-in service proxy
  • Challenges: JVM (size), latencies, …

Why not node-proxy?

  • Per-node resource consumption is unpredictable
  • Per-node proxy must ensure fairness
  • Blast radius is always the entire node
  • Per-node proxy is a fresh attack vector

Why sidecar?

  • Transparent (ish)
  • Part of app lifecycle (up/down)
  • Single tenant
  • No noisy neighbor

Sidecar drawbacks

  • Race conditions
  • Security of certs/keys
  • Difficult sizing
  • Apps need to be proxy aware
  • Can be circumvented
  • Challenging upgrades (infra and app live side by side)

Our lord and savior

  • Potential solution: eBPF
  • Problem: Not quite the perfect solution
  • Result: We still need a L7 proxy (but some L4 stuff can be implemented in kernel)

Why sidecarless

  • Full transparency
  • Optimized networking
  • Lower resource allocation
  • No race conditions
  • No manual pod injection
  • No credentials in the app

Architecture

  • Control Plane
  • Data Plane
  • mTLS
  • Observability
  • Traffic Control

Cilium

Basics

  • CNI with eBPF on L3/4
  • A lot of nice observability
  • Kubeproxy replacement
  • Ingress (via Gateway API)
  • Mutual Authentication
  • Specialized CiliumNetworkPolicy
  • Configure Envoy through Cilium

Control Plane

  • Cilium-Agent on each node that reacts to scheduled workloads by programming the local data-plane
  • API via Gateway API and CiliumNetworkPolicy
flowchart TD
    subgraph kubeserver
        kubeapi
    end
    subgraph node1
        kubeapi<-->control1
        control1-->data1
    end
    subgraph node2
        kubeapi<-->control2
        control2-->data2
    end
    subgraph node3
        kubeapi<-->control3
        control3-->data3
    end

Data plane

  • Configured by control plane
  • Does all the eBPF things in L4
  • Does all the envoy things in L7
  • In-Kernel WireGuard for optional transparent encryption

mTLS

  • Network Policies get applied at the eBPF layer (check if ID a can talk to ID 2)
  • When mTLS is enabled there is an auth check in advance -> If it fails, proceed with agents
  • Talk to each other for mTLS Auth and save the result to a cache -> Now eBPF can say yes
  • Problems: The caches can lead to ID confusion

Istio

Basics

  • L4/7 Service mesh without its own CNI
  • Based on envoy
  • mTLS
  • Classically via sidecar, nowadays

Ambient mode

  • Separate L4 and L7 -> Can run on cilium
  • mTLS
  • Gateway API

Control plane

flowchart TD
    kubeapi-->xDS

    xDS-->dataplane1
    xDS-->dataplane2

    subgraph node1
        dataplane1
    end

    subgraph node2
        dataplane2
    end
  • Central xDS Control Plane
  • Per-Node Data-plane that reads updates from Control Plane

Data Plane

  • L4 runs via zTunnel Daemonset that handles mTLS
  • The zTunnel traffic gets handed over to the CNI
  • L7 Proxy lives somewhere™ and traffic gets routed through it as an “extra hop” aka waypoint

mTLS

  • The zTunnel creates a HBONE (HTTP overlay network) tunnel with mTLS

Networking

Who have I talked to today, are there any follow-ups or learnings?

Operator Framework

  • We talked about the operator lifecycle manager
  • They shared the roadmap and the new release 1.0 will bring support for Operator Bundle loading from any OCI source (no more public-registry enforcement)

Flux

Cloud foundry/Paketo

  • We mostly had some smalltalk
  • There will be a cloud foundry day in Karlsruhe in October, they’d be happy to have us there
  • The whole KORFI (Cloud foundry on Kubernetes) Project is still going strong, but no release candidate yet (or in the near future)

Traefik

Note

They will follow up

  • We mostly talked about traefik hub as an API-portal

Postman

  • I asked them about their new cloud-only stuff: They will keep their direction
  • They are also planning to work on info materials on why postman SaaS is not a big security risk

Mattermost

Note

I should follow up

  • I talked about our problems with the Mattermost operator and was asked to get back to them with the errors
  • They’re currently migrating the Mattermost cloud offering to arm - therefor arm support will be coming in the next months
  • The Mattermost guy had exactly the same problems with notifications and read/unread using element

Vercel

  • Nice guys, talked a bit about convincing customers to switch to the edge
  • Also talked about policy validation

Renovate

  • The paid renovate offering now includes build failure estimation
  • I was told not to buy it after telling the technical guy that we just use build pipelines as MR verification

Cert manager

  • The best swag (judged by coolness points)

Upwind

Note

They will follow up with a quick demo

  • A Kubernetes security/runtime security solution with pretty nice looking urgency filters
  • Includes eBPF to see what code actually runs
  • I’ll witness a demo in early/mid April

Isovalent

  • Dinner (very tasty)
  • Cilium still sounds like the way to go in regard to CNIs