Opening Keynotes

The first “event” of the day was - as always - the opening keynote. Today presented by Red Hat and Syntasso.

Dx
Platform

Sometimes lipstick is exactly what a pig needs

By VMware (of all people) - kinda funny that they chose this title with the whole Broadcom fun. The main topic of this talk is: What interface do we choose for what capability.

Personas

Experts: Kubernetes, DB engineer
Users: Employees that just want to do stuff
Platform engineers: Connect Users to Services by Experts

Goal

Create Interfaces: Connect Users to Services
Problem: Many types of Interfaces (SaaS, GUI, CLI) with different capabilities

Dimensions

These are the dimensions of interface design proposed in the talk

Autonomy: external dependency (low) <-> self-service (high)
- low: Ticket system -> But sometimes good for getting an expert
- high: Portal -> Nice, but sometimes we just need a human contact
Contextual distance: stay in the same tool (low) <-> switch tools (high)
- low: IDE plugin -> High potential friction if stuff goes wrong/complex (context switch needed)
- high: Wiki or ticketing system
Capability skill: anyone can do it (low) <-> Made for experts (high)
- low: transparent sidecar (e.g. vulnerability scanner)
- high: CLI
Interface skill: anyone can do it (low) <-> needs specialized interface skills (high)
- low: Documentation in web aka wiki-style
- high: Code templates (a sample helm values.yaml or raw terraform provider)

Recap

You can use multiple interfaces for one capability
APIs (proverbial pig) are the most important interface b/c it can provide the baseline for all other interfaces
The beautification (lipstick) of the API through other interfaces makes users happy

Beyond Platform Thinking at Ritchie Brothers - Build Things No One Expects, in a Place No One Expect

Watch talk on YouTube

The story of how Thoughtworks buit YY at Ritchie Bros (RB). Presented by the implementers at Thoughtworks (TW).

Backgroud

RB is a auctioneer in the field of heavy machinery
Problem: They are old(ish) and own a bunch of other companies -> Duplicate Solutions
Goals
- Get rid of duplicates
- Scale without the need of more personel

Platform creation principles

Platform is a product
Building is a exercise in software eng. not operations
Reduce dev friction

Platform overview

Platform provides selfservices
Teams manage everything inside their namespace themselfes
Multiple global locations that can be opted-in and -out

Principles and Solutions

Compliance at source of change

Developers own their pipelines

Dev teams are responsible for scanning, etc
Platform verifies thath the compliance scans have been done (through admission control)
Examples:
- OPA + Gatekeeper for admission -> Teams use snyk for scanning and admission checks the scan results
- ira as admission hook for approval -> PO approves in Jira, admission only acceps if webhook is approved

Platform Operators

Implemented: S3 Operator, IAM Operator, DynamoDB Operatopr
Reasons:
- Devs should not need access to AWS/GCP directly
- Teams have full control while not needing to submit tickets or write terraform
Goals
- Abstract specific details away
- Make the results cloud-portable (AWS, GCP, Azure)
- Still retain developer transparency
Example: DynamoDB Database
1. User: creates dynamo CR and ServiceRole CR
2. K8S: Create Pods, Secrets, Configs and Serviceaccount (related to a IAM Role)
3. User: Creates S3 Bucket CR and assignes ServiceRole
4. K8s: Injects secrets and configs where needed

Observability

Tool: Honeycomb
Metrics: OpenTelemetry
- Operator reconcile steps are exposed as traces

Q&A

Your teams are pretty autonomous -> What to do with more classic teams: Over a multi-year journey every team settles on the ownership and self-service approach
How teams get access to stages: They just get themselves a stage namespace, attach to ingress and have fun (admission handles the rest)

Dx
Platform

Blueprints of Innovation: Engineering Paved Paths for a User-Friendly Developer Platform

Watch talk on YouTube

This talk was by a New York Times software developer. No real value

Baseline

How do we build composable components
Workflow of a new service: Create/Onboard -> Develop -> Build/Test/deploy (CI/CD) -> Run (Runtime/Cloud) -> Route (Ingress)

What do we need

User documentation
Adoption & Partnership
Platform as a Product
Customer feedback

Key Takeaways from Scaling Adobe's CI/CD Solution to Support >50K Argo CD Apps

Watch talk on YouTube

Part of the Multi-tenancy Con presented by Adobe

Challenges

Spin up Edge Infra globally fast

Implementation

First try - Single Tenant Cluster

Azure in Base - AWS on the edge
Single Tenant Clusters (Simpler Governance)
Responsibility is Shared between App and Platform (Monitoring, Ingress, etc.)
Problem: Huge manual investment and over provisioning
Result: Access Control to tenant Namespaces and Capacity Planning -> Pretty much a multi tenant cluster with one tenant per cluster

Second Try - Micro Clusters

One Cluster per Service

Third Try - Multi-tenancy

Use a bunch of components deployed by platform Team (Ingress, CD/CD, Monitoring, …)
Harmonized general Runtime (cloud-agnostic): Code-named Ethos -> Over 300 Clusters
Both shared clusters (shared by namespace) and dedicated clusters
Cluster config is a basic JSON with name, capacity, teams
Capacity Management gets Monitored using Prometheus
Cluster Changes should be nondestructive -> K8S-Shredder
Cost efficiency: Use good PDBs and liveliness/readiness Probes alongside resource requests and limits

Conclusion

There is a balance between cost, customization, setup and security between single-tenant and multi-tenant

Lightning talks

The lightning talks are 10-minute talks by different CNCF projects.

Building containers at scale using buildpacks

A Project lightning talk by Heroku and the CNCF buildpacks.

How and why buildpacks?

What: A simple way to build reproducible container images
Why: Scale, Reuse, Rebase: Buildpacks are structured as layers
- Dependencies, app builds and the runtime are seperated -> Easy update
How: Use the Pack CLI pack build <image> docker run <image>

Konveyor

A Platform for migration of legacy apps to cloud native platforms.

Parts: Hub, Analysis (with language server), assessment
Roadmap: Multi language support, GenAI, Asset Generation (e.g. Kube Deployments)

Argo’s Community Driven Development

Pretty much a short introduction to Argo Project

Project Parts: Workflows (CI), Events, CD, Rollouts
NPS: Net Promoter Score (How likely are you to recommend this) -> Everyone loves Argo (based on their survey)
Rollouts: Can be based with Prometheus metrics

Flux

Components: Helm, Kustomize, Terraform, …
Flagger Now supports gateway API, Prometheus, Datadog and more
New Releases

A quick look at the TAG App-Delivery

Mission: Everything related to cloud-native application delivery
Bi-Weekly Meetings
Subgroup: Platforms

The Hitchhiker's Guide to Kubernetes Platforms: Don’t Panic, Just Launch!

Watch talk on YouTube

This talk looks at bootstrapping Platforms using KServe. They do this in regard to AI Workflows.

Scenario

Deploy AI Workloads - Sometime consisting of different parts
Models get stored in a model registry

Baseline

Consistent APIs throughout the platform
Not the kube API directly b/c:
- Data scientists are a bit overpowered by the kube API
- Not only Kubernetes (also monitoring tools, feedback tools, etc.)
- Better debugging experience for specific workloads

The debugging API

Specific API with enhanced statuses and consistent UX across Code and UI
Example Endpoints: Pods, Deployments, InferenceServices
Provides a status summary-> Consistent health info across all related resources
- Example: Deployments have progress/availability, Pods have phases, Containers have readiness -> What do we interpret how?
- Evaluation: Progressing, Available Count vs Readiness, Replicafailure, Pod Phase, Container Readiness
The rules themselves may be pretty complex, but - since the user doesn’t have to check them themselves - the status is simple

Debugging Metrics

Dashboards (Utilization, throughput, latency)
Events
Logs

Deployment API

Launchpad: Just select your model and version -> The DB (dock) stores all manifests (Spaceship)
Manifests relate to models from a model registry
Multi-tenancy is implemented using k8s namespaces
Kine is used to replace/extend etcd with the relational dock db -> Relation namespace<->manifests is stored here and RBAC can be used
Launchpad: Select Namespace and check resource (fuel) availability/utilization

Cluster maintenance

Deployments can be launched to multiple clusters (even two clusters at once) -> HA through identical clusters
The exact same manifests get deployed to two clusters
Cluster desired state is stored externally to enable effortless upgrades, rescale, etc

Versioning API

Basically the dock DB
CRDs are the representations of the inference manifests
Rollbacks, Promotion and History is managed via the CRs
Why not GitOps: Internal Diffs, deployment overrides, customized features

UX

User driven API design
Customized tools
Everything gets 1:1 replicated for HA
Large onboarding guide

From Zero to Hero: Scaling Postgres in Kubernetes Using the Power of CloudNativePG

A short Talk as Part of the Data on Kubernetes day - presented by the VP of Cloud Native at EDB (one of the biggest PG contributors) Stated target: Make the world your single point of failure

Proposal

Get rid of Vendor-Lockin using the OSS projects PG, K8S and CnPG
PG was the DB of the year 2023 and a bunch of other times in the past
CnPG is a Level 5 mature operator

4 Pillars

Seamless Kube API Integration (Operator Pattern)
Advanced observability (Prometheus Exporter, JSON logging)
Declarative Config (Deploy, Scale, Maintain)
Secure by default (Robust containers, mTLS, and so on)

Clusters

Basic Resource that defines name, instances, sync and storage (and other parameters that have same defaults)
Implementation: Operator creates:
- The volumes (PG_Data, WAL (Write ahead log)
- Primary and Read-Write Service
- Replicas
- Read-Only Service (points at replicas)
Failover:
- Failure detected
- Stop R/W Service
- Promote Replica
- Activate R/W Service
- Kill old primary and demote to replica

Backup/Recovery

Continuous Backup: Write Ahead Log Backup to object store
Physical: Create from primary or standby to object store or kube volumes
Recovery: Copy full backup and apply WAL until target (last transaction or specific timestamp) is reached
Replica Cluster: Basically recreates a new cluster to a full recovery but keeps the cluster in Read-Only Replica Mode
Planned: Backup Plugin Interface

Multi-Cluster

Just create a replica cluster via WAL-files from S3 on another kube cluster (lags 5 mins behind)
You can also activate replication streaming

Recommended architecture

Dev Cluster: 1 Instance without PDB and with Continuous backup
Prod: 3 Nodes with automatic failover and continuous backups
Symmetric: Two clusters
- Primary: 3-Node Cluster
- Secondary: WAL based 3-Node Cluster with a designated primary (to take over if primary cluster fails)
Symmetric Streaming: Same as Secondary, but you manually enable the streaming API for live replication
Cascading Replication: Scale Symmetric to more clusters
Single availability zone: Well, do your best to spread to nodes and aspire to stretched Kubernetes to more AZs

Roadmap

Replica Cluster (Symmetric) Switchover
Synchronous Symmetric
3rd Party Plugins
Manage DBs via the Operator
Storage Autoscaling

Unleashing the Power of Serverless on Kubernetes with Knative, Crossplane, Dapr, KEDA, and Friends

When I say serverless I don’t mean lambda - I mean serverless That is thousands of lines of YAML - but I don’t want to depress you It will be eventually done Imagine this error is not happening Just imagine how I did this last night

Goal

Take my source code and run it, scale it - just don’t ask me

Baseline

Use Kubernetes for platform
Use kNative for autoscaling
Use Kaniko/Shipwright for building
Use Dupr for inter-service Communication

Open function

The glue between different tools to achieve serverless

CRD that describes:
- Build this image and push it to the registry
- Use this builder to build my project
- This in my Repo
- My App listens on this port
- Annotations

Dependencies

Open Questions
- Where are the serverless servers -> Cluster, dependencies, secrets
- How do I create DBs, etc.
Resulting needs
- CLUSTERaaS (using crossplane - in this case using AWS)
- DBaaS (using crossplane - again using pg on AWS)
- APPaaS

Db
Platform

Lessons Learned from Building a Database Operator

Another talk as part of the Data On Kubernetes Day.

Baseline

Hosting Models

Managed: Atlas
Semi: Cloud manager
Self-hosted: Enterprise and community operator

MongoDB on K8s

Cluster Architecture
- Control Plane: Operator
- Data Plane: MongoDB Server + Agent (Sidecar Proxy)
Enterprise Operator
- OpsManager CR: Deploys 3-node operator DB and OpsManager
- MongoDB CR: The MongoDB clusters (Compromised of agents)
Advanced use case: Data Platform with MongoDB on demand
- Control Plane on one cluster (or on VMs/Bare-metal), data plane in tenant clusters
- Result: MongoDB CR can not relate to OpsManager CR directly

Pitfalls

Storage: Agnostic, Topology aware, configurable and resizable (can’t be done with Statefulset)
Networking: Cluster-internal (Pod to Pod/Service), External (Split horizon over multicluster)

To K8S and Beyond – Maturing Your Platform Engineering Initiative

Watch talk on YouTube

CNCF Platform maturity model

Was donated to the CNCF by Syntasso
Constantly evolving since 1.0 in November 2023

Overview

Entire matrix is available from CNCF

Levels (from tactical to strategic)
- Provisional
- Operational
- Scalable
- Optimizing
Dimensions:
- Investment: How are funds/staff allocated to platform capabilities
- Adoption: How and why do users discover this platform
- Interfaces: How do users interact with and consume platform capabilities
- Operations: How are platforms and capabilities planned, prioritized, developed and maintained
- Measurement: What is the process for gathering and incorporating feedback/learning?

Goals

Understand
- Outcomes & Practices
- Where are you at
- Limits & Opportunities
- Behaviors and outcome
Balance People and processes

Typical Journeys

Steps of the journey

What are your goals and limitations
What is my current landscape
Plan baby steps & iterate

Scenarios

Bad: I want to improve my k8s platform
Good: Scaling an enterprise COE (Center Of Excellence)
- What: Onboard 20 Teams within 20 Months and enforce 8 security regulations
- Where: We have a dedicated team of centrally funded people
- Lay the foundation: More funding for more, larger teams -> Switch from Project to platform mindset
- Do your technical Due diligence in parallel

Key Lessons

Know what your ultimate goals and constraints are
Know your landscape
Plan in baby steps and iterate
- Lay the foundation for building the right thing and not just anything
- Don’t forget to do your technical dd in parallel

Conclusion

Maturity model is a helpful part but not the entire plan

What Is Going on Within My Network? a Subtle Introduction to Cilium Hubble

Held by Cilium regarding eBPF and Hubble

eBPF

Extend the capabilities of the kernel without requiring to change the kernel source code or load modules

Benefits: Reduce performance overhead, gain deep visibility while being widely available
Example Tools: Parca (Profiling), Cilium (Networking), Hubble (Observability), Tetragon (Security)

Cilium

Open source Solution for network connectivity between workloads

Hubble

Observability-Layer for cilium

Feature set

CLI: TCP-Dump on steroids + API Client
UI: Graphical dependency and connectivity map
Prometheus + Grafana + OpenTelemetry compatible
Metrics up to L7

Where can it be used

Service dependency with frequency
Kinds of HTTP calls
Network Problems between L4 and L7 (including DNS)
Application Monitoring through status codes and latency
Security-Related Network Blocks
Services accessed from outside the cluster

Architecture

Cilium Agent: Runs as the CNI for all Pods
Server: Runs on each node and retrieves the eBPF from cilium
Relay: Provide visibility throughout all nodes

TL;DR

Hubble looks pretty nice

Day 1

Talk recommendations

Subsections of Day 1

Opening Keynotes

Sometimes lipstick is exactly what a pig needs

Personas

Goal

Dimensions

Recap

Beyond Platform Thinking at Ritchie Brothers - Build Things No One Expects, in a Place No One Expect

Backgroud

Platform creation principles

Platform overview

Principles and Solutions

Compliance at source of change

Platform Operators

Observability

Q&A

Blueprints of Innovation: Engineering Paved Paths for a User-Friendly Developer Platform

Baseline

What do we need

Key Takeaways from Scaling Adobe's CI/CD Solution to Support >50K Argo CD Apps

Challenges

Implementation

First try - Single Tenant Cluster

Second Try - Micro Clusters

Third Try - Multi-tenancy

Conclusion

Lightning talks

Building containers at scale using buildpacks

How and why buildpacks?

Konveyor

Argo’s Community Driven Development

Flux

A quick look at the TAG App-Delivery

The Hitchhiker's Guide to Kubernetes Platforms: Don’t Panic, Just Launch!

Scenario

Baseline

The debugging API

Debugging Metrics

Deployment API

Cluster maintenance

Versioning API

UX

From Zero to Hero: Scaling Postgres in Kubernetes Using the Power of CloudNativePG

Proposal

4 Pillars

Clusters

Backup/Recovery

Multi-Cluster

Recommended architecture

Roadmap

Unleashing the Power of Serverless on Kubernetes with Knative, Crossplane, Dapr, KEDA, and Friends

Goal

Baseline

Open function

Dependencies

Lessons Learned from Building a Database Operator

Baseline

Hosting Models

MongoDB on K8s

Pitfalls

To K8S and Beyond – Maturing Your Platform Engineering Initiative

CNCF Platform maturity model

Overview

Goals

Typical Journeys

Steps of the journey

Scenarios

Key Lessons

Conclusion

What Is Going on Within My Network? a Subtle Introduction to Cilium Hubble

eBPF

Cilium

Hubble

Feature set

Where can it be used

Architecture

TL;DR