Chapter 1

Day 1

Day one is the Day for co-located events aka CloudNativeCon. I spent most of the day attending the Platform Engineering Day - as one might have guessed it’s all about platform engineering.

Everything started with badge pickup - a very smooth experience (but that may be related to me showing up an hour or so too early).

Talk recommendations

  • Beyond Platform Thinking…
  • Hitchhiker’s Guide to …
  • To K8S and beyond…

Subsections of Day 1

Opening Keynotes

The first “event” of the day was - as always - the opening keynote. Today presented by Red Hat and Syntasso.

Sometimes lipstick is exactly what a pig needs

Watch talk on YouTube

By VMware (of all people) - kinda funny that they chose this title with the whole Broadcom fun. The main topic of this talk is: What interface do we choose for what capability.

Personas

  • Experts: Kubernetes, DB engineer
  • Users: Employees that just want to do stuff
  • Platform engineers: Connect Users to Services by Experts

Goal

  • Create Interfaces: Connect Users to Services
  • Problem: Many types of Interfaces (SaaS, GUI, CLI) with different capabilities

Dimensions

These are the dimensions of interface design proposed in the talk

  • Autonomy: external dependency (low) <-> self-service (high)
    • low: Ticket system -> But sometimes good for getting an expert
    • high: Portal -> Nice, but sometimes we just need a human contact
  • Contextual distance: stay in the same tool (low) <-> switch tools (high)
    • low: IDE plugin -> High potential friction if stuff goes wrong/complex (context switch needed)
    • high: Wiki or ticketing system
  • Capability skill: anyone can do it (low) <-> Made for experts (high)
    • low: transparent sidecar (e.g. vulnerability scanner)
    • high: CLI
  • Interface skill: anyone can do it (low) <-> needs specialized interface skills (high)
    • low: Documentation in web aka wiki-style
    • high: Code templates (a sample helm values.yaml or raw terraform provider)

Recap

  • You can use multiple interfaces for one capability
  • APIs (proverbial pig) are the most important interface b/c it can provide the baseline for all other interfaces
  • The beautification (lipstick) of the API through other interfaces makes users happy

Beyond Platform Thinking at Ritchie Brothers - Build Things No One Expects, in a Place No One Expect

Watch talk on YouTube

The story of how Thoughtworks buit YY at Ritchie Bros (RB). Presented by the implementers at Thoughtworks (TW).

Backgroud

  • RB is a auctioneer in the field of heavy machinery
  • Problem: They are old(ish) and own a bunch of other companies -> Duplicate Solutions
  • Goals
    • Get rid of duplicates
    • Scale without the need of more personel

Platform creation principles

  • Platform is a product
  • Building is a exercise in software eng. not operations
  • Reduce dev friction

Platform overview

  • Platform provides selfservices
  • Teams manage everything inside their namespace themselfes
  • Multiple global locations that can be opted-in and -out

Principles and Solutions

Compliance at source of change

Developers own their pipelines

  • Dev teams are responsible for scanning, etc
  • Platform verifies thath the compliance scans have been done (through admission control)
  • Examples:
    • OPA + Gatekeeper for admission -> Teams use snyk for scanning and admission checks the scan results
    • ira as admission hook for approval -> PO approves in Jira, admission only acceps if webhook is approved

Platform Operators

  • Implemented: S3 Operator, IAM Operator, DynamoDB Operatopr
  • Reasons:
    • Devs should not need access to AWS/GCP directly
    • Teams have full control while not needing to submit tickets or write terraform
  • Goals
    • Abstract specific details away
    • Make the results cloud-portable (AWS, GCP, Azure)
    • Still retain developer transparency
  • Example: DynamoDB Database
    1. User: creates dynamo CR and ServiceRole CR
    2. K8S: Create Pods, Secrets, Configs and Serviceaccount (related to a IAM Role)
    3. User: Creates S3 Bucket CR and assignes ServiceRole
    4. K8s: Injects secrets and configs where needed

Observability

  • Tool: Honeycomb
  • Metrics: OpenTelemetry
    • Operator reconcile steps are exposed as traces

Q&A

  • Your teams are pretty autonomous -> What to do with more classic teams: Over a multi-year journey every team settles on the ownership and self-service approach
  • How teams get access to stages: They just get themselves a stage namespace, attach to ingress and have fun (admission handles the rest)

Blueprints of Innovation: Engineering Paved Paths for a User-Friendly Developer Platform

Watch talk on YouTube

This talk was by a New York Times software developer. No real value

Baseline

  • How do we build composable components
  • Workflow of a new service: Create/Onboard -> Develop -> Build/Test/deploy (CI/CD) -> Run (Runtime/Cloud) -> Route (Ingress)

What do we need

  • User documentation
  • Adoption & Partnership
  • Platform as a Product
  • Customer feedback

Key Takeaways from Scaling Adobe's CI/CD Solution to Support >50K Argo CD Apps

Watch talk on YouTube

Part of the Multi-tenancy Con presented by Adobe

Challenges

  • Spin up Edge Infra globally fast

Implementation

First try - Single Tenant Cluster

  • Azure in Base - AWS on the edge
  • Single Tenant Clusters (Simpler Governance)
  • Responsibility is Shared between App and Platform (Monitoring, Ingress, etc.)
  • Problem: Huge manual investment and over provisioning
  • Result: Access Control to tenant Namespaces and Capacity Planning -> Pretty much a multi tenant cluster with one tenant per cluster

Second Try - Micro Clusters

  • One Cluster per Service

Third Try - Multi-tenancy

  • Use a bunch of components deployed by platform Team (Ingress, CD/CD, Monitoring, …)
  • Harmonized general Runtime (cloud-agnostic): Code-named Ethos -> Over 300 Clusters
  • Both shared clusters (shared by namespace) and dedicated clusters
  • Cluster config is a basic JSON with name, capacity, teams
  • Capacity Management gets Monitored using Prometheus
  • Cluster Changes should be nondestructive -> K8S-Shredder
  • Cost efficiency: Use good PDBs and liveliness/readiness Probes alongside resource requests and limits

Conclusion

  • There is a balance between cost, customization, setup and security between single-tenant and multi-tenant

Lightning talks

The lightning talks are 10-minute talks by different CNCF projects.

Building containers at scale using buildpacks

A Project lightning talk by Heroku and the CNCF buildpacks.

How and why buildpacks?

  • What: A simple way to build reproducible container images
  • Why: Scale, Reuse, Rebase: Buildpacks are structured as layers
    • Dependencies, app builds and the runtime are seperated -> Easy update
  • How: Use the Pack CLI pack build <image> docker run <image>

Konveyor

A Platform for migration of legacy apps to cloud native platforms.

  • Parts: Hub, Analysis (with language server), assessment
  • Roadmap: Multi language support, GenAI, Asset Generation (e.g. Kube Deployments)

Argo’s Community Driven Development

Pretty much a short introduction to Argo Project

  • Project Parts: Workflows (CI), Events, CD, Rollouts
  • NPS: Net Promoter Score (How likely are you to recommend this) -> Everyone loves Argo (based on their survey)
  • Rollouts: Can be based with Prometheus metrics

Flux

  • Components: Helm, Kustomize, Terraform, …
  • Flagger Now supports gateway API, Prometheus, Datadog and more
  • New Releases

A quick look at the TAG App-Delivery

  • Mission: Everything related to cloud-native application delivery
  • Bi-Weekly Meetings
  • Subgroup: Platforms

The Hitchhiker's Guide to Kubernetes Platforms: Don’t Panic, Just Launch!

Watch talk on YouTube

This talk looks at bootstrapping Platforms using KServe. They do this in regard to AI Workflows.

Scenario

  • Deploy AI Workloads - Sometime consisting of different parts
  • Models get stored in a model registry

Baseline

  • Consistent APIs throughout the platform
  • Not the kube API directly b/c:
    • Data scientists are a bit overpowered by the kube API
    • Not only Kubernetes (also monitoring tools, feedback tools, etc.)
    • Better debugging experience for specific workloads

The debugging API

  • Specific API with enhanced statuses and consistent UX across Code and UI
  • Example Endpoints: Pods, Deployments, InferenceServices
  • Provides a status summary-> Consistent health info across all related resources
    • Example: Deployments have progress/availability, Pods have phases, Containers have readiness -> What do we interpret how?
    • Evaluation: Progressing, Available Count vs Readiness, Replicafailure, Pod Phase, Container Readiness
  • The rules themselves may be pretty complex, but - since the user doesn’t have to check them themselves - the status is simple

Debugging Metrics

  • Dashboards (Utilization, throughput, latency)
  • Events
  • Logs

Deployment API

  • Launchpad: Just select your model and version -> The DB (dock) stores all manifests (Spaceship)
  • Manifests relate to models from a model registry
  • Multi-tenancy is implemented using k8s namespaces
  • Kine is used to replace/extend etcd with the relational dock db -> Relation namespace<->manifests is stored here and RBAC can be used
  • Launchpad: Select Namespace and check resource (fuel) availability/utilization

Cluster maintenance

  • Deployments can be launched to multiple clusters (even two clusters at once) -> HA through identical clusters
  • The exact same manifests get deployed to two clusters
  • Cluster desired state is stored externally to enable effortless upgrades, rescale, etc

Versioning API

  • Basically the dock DB
  • CRDs are the representations of the inference manifests
  • Rollbacks, Promotion and History is managed via the CRs
  • Why not GitOps: Internal Diffs, deployment overrides, customized features

UX

  • User driven API design
  • Customized tools
  • Everything gets 1:1 replicated for HA
  • Large onboarding guide

From Zero to Hero: Scaling Postgres in Kubernetes Using the Power of CloudNativePG

A short Talk as Part of the Data on Kubernetes day - presented by the VP of Cloud Native at EDB (one of the biggest PG contributors) Stated target: Make the world your single point of failure

Proposal

  • Get rid of Vendor-Lockin using the OSS projects PG, K8S and CnPG
  • PG was the DB of the year 2023 and a bunch of other times in the past
  • CnPG is a Level 5 mature operator

4 Pillars

  • Seamless Kube API Integration (Operator Pattern)
  • Advanced observability (Prometheus Exporter, JSON logging)
  • Declarative Config (Deploy, Scale, Maintain)
  • Secure by default (Robust containers, mTLS, and so on)

Clusters

  • Basic Resource that defines name, instances, sync and storage (and other parameters that have same defaults)
  • Implementation: Operator creates:
    • The volumes (PG_Data, WAL (Write ahead log)
    • Primary and Read-Write Service
    • Replicas
    • Read-Only Service (points at replicas)
  • Failover:
    • Failure detected
    • Stop R/W Service
    • Promote Replica
    • Activate R/W Service
    • Kill old primary and demote to replica

Backup/Recovery

  • Continuous Backup: Write Ahead Log Backup to object store
  • Physical: Create from primary or standby to object store or kube volumes
  • Recovery: Copy full backup and apply WAL until target (last transaction or specific timestamp) is reached
  • Replica Cluster: Basically recreates a new cluster to a full recovery but keeps the cluster in Read-Only Replica Mode
  • Planned: Backup Plugin Interface

Multi-Cluster

  • Just create a replica cluster via WAL-files from S3 on another kube cluster (lags 5 mins behind)
  • You can also activate replication streaming
  • Dev Cluster: 1 Instance without PDB and with Continuous backup
  • Prod: 3 Nodes with automatic failover and continuous backups
  • Symmetric: Two clusters
    • Primary: 3-Node Cluster
    • Secondary: WAL based 3-Node Cluster with a designated primary (to take over if primary cluster fails)
  • Symmetric Streaming: Same as Secondary, but you manually enable the streaming API for live replication
  • Cascading Replication: Scale Symmetric to more clusters
  • Single availability zone: Well, do your best to spread to nodes and aspire to stretched Kubernetes to more AZs

Roadmap

  • Replica Cluster (Symmetric) Switchover
  • Synchronous Symmetric
  • 3rd Party Plugins
  • Manage DBs via the Operator
  • Storage Autoscaling

Unleashing the Power of Serverless on Kubernetes with Knative, Crossplane, Dapr, KEDA, and Friends

When I say serverless I don’t mean lambda - I mean serverless That is thousands of lines of YAML - but I don’t want to depress you It will be eventually done Imagine this error is not happening Just imagine how I did this last night

Goal

  • Take my source code and run it, scale it - just don’t ask me

Baseline

  • Use Kubernetes for platform
  • Use kNative for autoscaling
  • Use Kaniko/Shipwright for building
  • Use Dupr for inter-service Communication

Open function

The glue between different tools to achieve serverless

  • CRD that describes:
    • Build this image and push it to the registry
    • Use this builder to build my project
    • This in my Repo
    • My App listens on this port
    • Annotations

Dependencies

  • Open Questions
    • Where are the serverless servers -> Cluster, dependencies, secrets
    • How do I create DBs, etc.
  • Resulting needs
    • CLUSTERaaS (using crossplane - in this case using AWS)
    • DBaaS (using crossplane - again using pg on AWS)
    • APPaaS

Lessons Learned from Building a Database Operator

Another talk as part of the Data On Kubernetes Day.

Baseline

Hosting Models

  • Managed: Atlas
  • Semi: Cloud manager
  • Self-hosted: Enterprise and community operator

MongoDB on K8s

  • Cluster Architecture
    • Control Plane: Operator
    • Data Plane: MongoDB Server + Agent (Sidecar Proxy)
  • Enterprise Operator
    • OpsManager CR: Deploys 3-node operator DB and OpsManager
    • MongoDB CR: The MongoDB clusters (Compromised of agents)
  • Advanced use case: Data Platform with MongoDB on demand
    • Control Plane on one cluster (or on VMs/Bare-metal), data plane in tenant clusters
    • Result: MongoDB CR can not relate to OpsManager CR directly

Pitfalls

  • Storage: Agnostic, Topology aware, configurable and resizable (can’t be done with Statefulset)
  • Networking: Cluster-internal (Pod to Pod/Service), External (Split horizon over multicluster)

To K8S and Beyond – Maturing Your Platform Engineering Initiative

Watch talk on YouTube

CNCF Platform maturity model

  • Was donated to the CNCF by Syntasso
  • Constantly evolving since 1.0 in November 2023

Overview

Entire matrix is available from CNCF

  • Levels (from tactical to strategic)
    • Provisional
    • Operational
    • Scalable
    • Optimizing
  • Dimensions:
    • Investment: How are funds/staff allocated to platform capabilities
    • Adoption: How and why do users discover this platform
    • Interfaces: How do users interact with and consume platform capabilities
    • Operations: How are platforms and capabilities planned, prioritized, developed and maintained
    • Measurement: What is the process for gathering and incorporating feedback/learning?

Goals

  • Understand
    • Outcomes & Practices
    • Where are you at
    • Limits & Opportunities
    • Behaviors and outcome
  • Balance People and processes

Typical Journeys

Steps of the journey

  1. What are your goals and limitations
  2. What is my current landscape
  3. Plan baby steps & iterate

Scenarios

  • Bad: I want to improve my k8s platform
  • Good: Scaling an enterprise COE (Center Of Excellence)
    • What: Onboard 20 Teams within 20 Months and enforce 8 security regulations
    • Where: We have a dedicated team of centrally funded people
    • Lay the foundation: More funding for more, larger teams -> Switch from Project to platform mindset
    • Do your technical Due diligence in parallel

Key Lessons

  • Know what your ultimate goals and constraints are
  • Know your landscape
  • Plan in baby steps and iterate
    • Lay the foundation for building the right thing and not just anything
    • Don’t forget to do your technical dd in parallel

Conclusion

  • Maturity model is a helpful part but not the entire plan

What Is Going on Within My Network? a Subtle Introduction to Cilium Hubble

Held by Cilium regarding eBPF and Hubble

eBPF

Extend the capabilities of the kernel without requiring to change the kernel source code or load modules

  • Benefits: Reduce performance overhead, gain deep visibility while being widely available
  • Example Tools: Parca (Profiling), Cilium (Networking), Hubble (Observability), Tetragon (Security)

Cilium

Open source Solution for network connectivity between workloads

Hubble

Observability-Layer for cilium

Feature set

  • CLI: TCP-Dump on steroids + API Client
  • UI: Graphical dependency and connectivity map
  • Prometheus + Grafana + OpenTelemetry compatible
  • Metrics up to L7

Where can it be used

  • Service dependency with frequency
  • Kinds of HTTP calls
  • Network Problems between L4 and L7 (including DNS)
  • Application Monitoring through status codes and latency
  • Security-Related Network Blocks
  • Services accessed from outside the cluster

Architecture

  • Cilium Agent: Runs as the CNI for all Pods
  • Server: Runs on each node and retrieves the eBPF from cilium
  • Relay: Provide visibility throughout all nodes

TL;DR

Hubble looks pretty nice