TikTok’s Edge Symphony: Scaling Beyond Boundaries with Multi-Cluster Controllers
A talk by TikTok/ByteDance (duh) focussed on using central controllers instead of on the edge.
Background
Global means non-china
- Edge platform team for CDN, livestreaming, uploads, real-time communication, etc.
- Around 250 cluster with 10-600 nodes each - mostly non-cloud aka bare-metal
- Architecture: Control plane clusters (platform services) - data plane clusters (workload by other teams)
- Platform includes logs, metrics, configs, secrets, …
Challenges
Operators
- Operators are essential for platform features
- As the feature requests increase, more operators are needed
- The deployment of operators throughout many clusters is complex (namespace, deployments, policies, …)
Edge
- Limited resources
- Cost implication of platform features
- Real time processing demands by platform features
- Balancing act between resources used by workload vs platform features (20-25%)
The classic flow
- New feature gets requested
- Use kubebuider with the SDK to create the operator
- Create namespaces and configs in all clusters
- Deploy operator to all clusters
Possible Solution
Centralized Control Plane
- Problem: The controller implementation is limited to a cluster boundary
- Idea: Why not create a single operator that can manage multiple edge clusters
- Implementation: Just modify kubebuilder to accept multiple clients (and caches)
- Result: It works -> Simpler deployment and troubleshooting
- Concerns: High code complexity -> Long familiarization
- Balance between “simple central operator” and operator-complexity is hard
Attempt it a bit more like kubebuilder
- Each cluster has its own manager
- There is a central multimanager that starts all the cluster specific manager
- Controller registration to the manager now handles cluster names
- The reconciler knows which cluster it is working on
- The multi cluster management basically just test all the cluster secrets and create a manager+controller for each cluster secret
- Challenges: Network connectivity
- Solutions:
- Dynamic add/remove of clusters with go channels to prevent pod restarts
- Connectivity health checks -> For loss the
recreate manager
gets triggered
flowchart TD mcm-->m1 mcm-->m2 mcm-->m3
flowchart LR secrets-->ch(go channels) ch-->|CREATE|create(Create manager + Add controller + Start manager) ch-->|UPDATE|update(Stop manager + Create manager + Add controller + Start manager) ch-->|DELETE|delete(Stop manager)
Conclusion
- Acknowledge resource constraints on edge
- Embrace open source adoption instead of build your own
- Simplify deployment
- Recognize your own opinionated approach and it’s use cases