Chapter 1
Day 1
Day one is the Day for co-located events aka CloudNativeCon.
I spent most of the day attending the Platform Engineering Day - as one might have guessed it’s all about platform engineering.
Everything started with badge pickup - a very smooth experience (but that may be related to me showing up an hour or so too early).
Talk recommendations
- Beyond Platform Thinking…
- Hitchhiker’s Guide to …
- To K8S and beyond…
Subsections of Day 1
Opening Keynotes
The first “event” of the day was - as always - the opening keynote.
Today presented by Red Hat and Syntasso.
Sometimes lipstick is exactly what a pig needs
Watch talk on YouTube
By VMware (of all people) - kinda funny that they chose this title with the whole Broadcom fun.
The main topic of this talk is: What interface do we choose for what capability.
Personas
- Experts: Kubernetes, DB engineer
- Users: Employees that just want to do stuff
- Platform engineers: Connect Users to Services by Experts
Goal
- Create Interfaces: Connect Users to Services
- Problem: Many types of Interfaces (SaaS, GUI, CLI) with different capabilities
Dimensions
These are the dimensions of interface design proposed in the talk
- Autonomy: external dependency (low) <-> self-service (high)
- low: Ticket system -> But sometimes good for getting an expert
- high: Portal -> Nice, but sometimes we just need a human contact
- Contextual distance: stay in the same tool (low) <-> switch tools (high)
- low: IDE plugin -> High potential friction if stuff goes wrong/complex (context switch needed)
- high: Wiki or ticketing system
- Capability skill: anyone can do it (low) <-> Made for experts (high)
- low: transparent sidecar (e.g. vulnerability scanner)
- high: CLI
- Interface skill: anyone can do it (low) <-> needs specialized interface skills (high)
- low: Documentation in web aka wiki-style
- high: Code templates (a sample helm values.yaml or raw terraform provider)
Recap
- You can use multiple interfaces for one capability
- APIs (proverbial pig) are the most important interface b/c it can provide the baseline for all other interfaces
- The beautification (lipstick) of the API through other interfaces makes users happy
Watch talk on YouTube
The story of how Thoughtworks buit YY at Ritchie Bros (RB).
Presented by the implementers at Thoughtworks (TW).
Backgroud
- RB is a auctioneer in the field of heavy machinery
- Problem: They are old(ish) and own a bunch of other companies -> Duplicate Solutions
- Goals
- Get rid of duplicates
- Scale without the need of more personel
- Platform is a product
- Building is a exercise in software eng. not operations
- Reduce dev friction
- Platform provides selfservices
- Teams manage everything inside their namespace themselfes
- Multiple global locations that can be opted-in and -out
Principles and Solutions
Compliance at source of change
Developers own their pipelines
- Dev teams are responsible for scanning, etc
- Platform verifies thath the compliance scans have been done (through admission control)
- Examples:
- OPA + Gatekeeper for admission -> Teams use snyk for scanning and admission checks the scan results
- ira as admission hook for approval -> PO approves in Jira, admission only acceps if webhook is approved
- Implemented: S3 Operator, IAM Operator, DynamoDB Operatopr
- Reasons:
- Devs should not need access to AWS/GCP directly
- Teams have full control while not needing to submit tickets or write terraform
- Goals
- Abstract specific details away
- Make the results cloud-portable (AWS, GCP, Azure)
- Still retain developer transparency
- Example: DynamoDB Database
- User: creates dynamo CR and ServiceRole CR
- K8S: Create Pods, Secrets, Configs and Serviceaccount (related to a IAM Role)
- User: Creates S3 Bucket CR and assignes ServiceRole
- K8s: Injects secrets and configs where needed
Observability
- Tool: Honeycomb
- Metrics: OpenTelemetry
- Operator reconcile steps are exposed as traces
Q&A
- Your teams are pretty autonomous -> What to do with more classic teams: Over a multi-year journey every team settles on the ownership and self-service approach
- How teams get access to stages: They just get themselves a stage namespace, attach to ingress and have fun (admission handles the rest)
Watch talk on YouTube
This talk was by a New York Times software developer.
No real value
Baseline
- How do we build composable components
- Workflow of a new service: Create/Onboard -> Develop -> Build/Test/deploy (CI/CD) -> Run (Runtime/Cloud) -> Route (Ingress)
What do we need
- User documentation
- Adoption & Partnership
- Platform as a Product
- Customer feedback
Key Takeaways from Scaling Adobe's CI/CD Solution to Support >50K Argo CD Apps
Watch talk on YouTube
Part of the Multi-tenancy Con presented by Adobe
Challenges
- Spin up Edge Infra globally fast
Implementation
First try - Single Tenant Cluster
- Azure in Base - AWS on the edge
- Single Tenant Clusters (Simpler Governance)
- Responsibility is Shared between App and Platform (Monitoring, Ingress, etc.)
- Problem: Huge manual investment and over provisioning
- Result: Access Control to tenant Namespaces and Capacity Planning -> Pretty much a multi tenant cluster with one tenant per cluster
Second Try - Micro Clusters
Third Try - Multi-tenancy
- Use a bunch of components deployed by platform Team (Ingress, CD/CD, Monitoring, …)
- Harmonized general Runtime (cloud-agnostic): Code-named Ethos -> Over 300 Clusters
- Both shared clusters (shared by namespace) and dedicated clusters
- Cluster config is a basic JSON with name, capacity, teams
- Capacity Management gets Monitored using Prometheus
- Cluster Changes should be nondestructive -> K8S-Shredder
- Cost efficiency: Use good PDBs and liveliness/readiness Probes alongside resource requests and limits
Conclusion
- There is a balance between cost, customization, setup and security between single-tenant and multi-tenant
Lightning talks
The lightning talks are 10-minute talks by different CNCF projects.
Building containers at scale using buildpacks
A Project lightning talk by Heroku and the CNCF buildpacks.
How and why buildpacks?
- What: A simple way to build reproducible container images
- Why: Scale, Reuse, Rebase: Buildpacks are structured as layers
- Dependencies, app builds and the runtime are seperated -> Easy update
- How: Use the Pack CLI
pack build <image>
docker run <image>
Konveyor
A Platform for migration of legacy apps to cloud native platforms.
- Parts: Hub, Analysis (with language server), assessment
- Roadmap: Multi language support, GenAI, Asset Generation (e.g. Kube Deployments)
Pretty much a short introduction to Argo Project
- Project Parts: Workflows (CI), Events, CD, Rollouts
- NPS: Net Promoter Score (How likely are you to recommend this) -> Everyone loves Argo (based on their survey)
- Rollouts: Can be based with Prometheus metrics
Flux
- Components: Helm, Kustomize, Terraform, …
- Flagger Now supports gateway API, Prometheus, Datadog and more
- New Releases
A quick look at the TAG App-Delivery
- Mission: Everything related to cloud-native application delivery
- Bi-Weekly Meetings
- Subgroup: Platforms
Watch talk on YouTube
This talk looks at bootstrapping Platforms using KServe.
They do this in regard to AI Workflows.
Scenario
- Deploy AI Workloads - Sometime consisting of different parts
- Models get stored in a model registry
Baseline
- Consistent APIs throughout the platform
- Not the kube API directly b/c:
- Data scientists are a bit overpowered by the kube API
- Not only Kubernetes (also monitoring tools, feedback tools, etc.)
- Better debugging experience for specific workloads
The debugging API
- Specific API with enhanced statuses and consistent UX across Code and UI
- Example Endpoints: Pods, Deployments, InferenceServices
- Provides a status summary-> Consistent health info across all related resources
- Example: Deployments have progress/availability, Pods have phases, Containers have readiness -> What do we interpret how?
- Evaluation: Progressing, Available Count vs Readiness, Replicafailure, Pod Phase, Container Readiness
- The rules themselves may be pretty complex, but - since the user doesn’t have to check them themselves - the status is simple
Debugging Metrics
- Dashboards (Utilization, throughput, latency)
- Events
- Logs
Deployment API
- Launchpad: Just select your model and version -> The DB (dock) stores all manifests (Spaceship)
- Manifests relate to models from a model registry
- Multi-tenancy is implemented using k8s namespaces
- Kine is used to replace/extend etcd with the relational dock db -> Relation namespace<->manifests is stored here and RBAC can be used
- Launchpad: Select Namespace and check resource (fuel) availability/utilization
Cluster maintenance
- Deployments can be launched to multiple clusters (even two clusters at once) -> HA through identical clusters
- The exact same manifests get deployed to two clusters
- Cluster desired state is stored externally to enable effortless upgrades, rescale, etc
Versioning API
- Basically the dock DB
- CRDs are the representations of the inference manifests
- Rollbacks, Promotion and History is managed via the CRs
- Why not GitOps: Internal Diffs, deployment overrides, customized features
UX
- User driven API design
- Customized tools
- Everything gets 1:1 replicated for HA
- Large onboarding guide
From Zero to Hero: Scaling Postgres in Kubernetes Using the Power of CloudNativePG
A short Talk as Part of the Data on Kubernetes day - presented by the VP of Cloud Native at EDB (one of the biggest PG contributors)
Stated target: Make the world your single point of failure
Proposal
- Get rid of Vendor-Lockin using the OSS projects PG, K8S and CnPG
- PG was the DB of the year 2023 and a bunch of other times in the past
- CnPG is a Level 5 mature operator
4 Pillars
- Seamless Kube API Integration (Operator Pattern)
- Advanced observability (Prometheus Exporter, JSON logging)
- Declarative Config (Deploy, Scale, Maintain)
- Secure by default (Robust containers, mTLS, and so on)
Clusters
- Basic Resource that defines name, instances, sync and storage (and other parameters that have same defaults)
- Implementation: Operator creates:
- The volumes (PG_Data, WAL (Write ahead log)
- Primary and Read-Write Service
- Replicas
- Read-Only Service (points at replicas)
- Failover:
- Failure detected
- Stop R/W Service
- Promote Replica
- Activate R/W Service
- Kill old primary and demote to replica
Backup/Recovery
- Continuous Backup: Write Ahead Log Backup to object store
- Physical: Create from primary or standby to object store or kube volumes
- Recovery: Copy full backup and apply WAL until target (last transaction or specific timestamp) is reached
- Replica Cluster: Basically recreates a new cluster to a full recovery but keeps the cluster in Read-Only Replica Mode
- Planned: Backup Plugin Interface
Multi-Cluster
- Just create a replica cluster via WAL-files from S3 on another kube cluster (lags 5 mins behind)
- You can also activate replication streaming
Recommended architecture
- Dev Cluster: 1 Instance without PDB and with Continuous backup
- Prod: 3 Nodes with automatic failover and continuous backups
- Symmetric: Two clusters
- Primary: 3-Node Cluster
- Secondary: WAL based 3-Node Cluster with a designated primary (to take over if primary cluster fails)
- Symmetric Streaming: Same as Secondary, but you manually enable the streaming API for live replication
- Cascading Replication: Scale Symmetric to more clusters
- Single availability zone: Well, do your best to spread to nodes and aspire to stretched Kubernetes to more AZs
Roadmap
- Replica Cluster (Symmetric) Switchover
- Synchronous Symmetric
- 3rd Party Plugins
- Manage DBs via the Operator
- Storage Autoscaling
Unleashing the Power of Serverless on Kubernetes with Knative, Crossplane, Dapr, KEDA, and Friends
When I say serverless I don’t mean lambda - I mean serverless
That is thousands of lines of YAML - but I don’t want to depress you
It will be eventually done
Imagine this error is not happening
Just imagine how I did this last night
Goal
- Take my source code and run it, scale it - just don’t ask me
Baseline
- Use Kubernetes for platform
- Use kNative for autoscaling
- Use Kaniko/Shipwright for building
- Use Dupr for inter-service Communication
Open function
The glue between different tools to achieve serverless
- CRD that describes:
- Build this image and push it to the registry
- Use this builder to build my project
- This in my Repo
- My App listens on this port
- Annotations
Dependencies
- Open Questions
- Where are the serverless servers -> Cluster, dependencies, secrets
- How do I create DBs, etc.
- Resulting needs
- CLUSTERaaS (using crossplane - in this case using AWS)
- DBaaS (using crossplane - again using pg on AWS)
- APPaaS
Lessons Learned from Building a Database Operator
Another talk as part of the Data On Kubernetes Day.
Baseline
Hosting Models
- Managed: Atlas
- Semi: Cloud manager
- Self-hosted: Enterprise and community operator
MongoDB on K8s
- Cluster Architecture
- Control Plane: Operator
- Data Plane: MongoDB Server + Agent (Sidecar Proxy)
- Enterprise Operator
- OpsManager CR: Deploys 3-node operator DB and OpsManager
- MongoDB CR: The MongoDB clusters (Compromised of agents)
- Advanced use case: Data Platform with MongoDB on demand
- Control Plane on one cluster (or on VMs/Bare-metal), data plane in tenant clusters
- Result: MongoDB CR can not relate to OpsManager CR directly
Pitfalls
- Storage: Agnostic, Topology aware, configurable and resizable (can’t be done with Statefulset)
- Networking: Cluster-internal (Pod to Pod/Service), External (Split horizon over multicluster)
Watch talk on YouTube
- Was donated to the CNCF by Syntasso
- Constantly evolving since 1.0 in November 2023
Overview
Entire matrix is available from CNCF
- Levels (from tactical to strategic)
- Provisional
- Operational
- Scalable
- Optimizing
- Dimensions:
- Investment: How are funds/staff allocated to platform capabilities
- Adoption: How and why do users discover this platform
- Interfaces: How do users interact with and consume platform capabilities
- Operations: How are platforms and capabilities planned, prioritized, developed and maintained
- Measurement: What is the process for gathering and incorporating feedback/learning?
Goals
- Understand
- Outcomes & Practices
- Where are you at
- Limits & Opportunities
- Behaviors and outcome
- Balance People and processes
Typical Journeys
Steps of the journey
- What are your goals and limitations
- What is my current landscape
- Plan baby steps & iterate
Scenarios
- Bad: I want to improve my k8s platform
- Good: Scaling an enterprise COE (Center Of Excellence)
- What: Onboard 20 Teams within 20 Months and enforce 8 security regulations
- Where: We have a dedicated team of centrally funded people
- Lay the foundation: More funding for more, larger teams -> Switch from Project to platform mindset
- Do your technical Due diligence in parallel
Key Lessons
- Know what your ultimate goals and constraints are
- Know your landscape
- Plan in baby steps and iterate
- Lay the foundation for building the right thing and not just anything
- Don’t forget to do your technical dd in parallel
Conclusion
- Maturity model is a helpful part but not the entire plan
What Is Going on Within My Network? a Subtle Introduction to Cilium Hubble
Held by Cilium regarding eBPF and Hubble
eBPF
Extend the capabilities of the kernel without requiring to change the kernel source code or load modules
- Benefits: Reduce performance overhead, gain deep visibility while being widely available
- Example Tools: Parca (Profiling), Cilium (Networking), Hubble (Observability), Tetragon (Security)
Cilium
Open source Solution for network connectivity between workloads
Hubble
Observability-Layer for cilium
Feature set
- CLI: TCP-Dump on steroids + API Client
- UI: Graphical dependency and connectivity map
- Prometheus + Grafana + OpenTelemetry compatible
- Metrics up to L7
Where can it be used
- Service dependency with frequency
- Kinds of HTTP calls
- Network Problems between L4 and L7 (including DNS)
- Application Monitoring through status codes and latency
- Security-Related Network Blocks
- Services accessed from outside the cluster
Architecture
- Cilium Agent: Runs as the CNI for all Pods
- Server: Runs on each node and retrieves the eBPF from cilium
- Relay: Provide visibility throughout all nodes
TL;DR
Hubble looks pretty nice