TikTok’s Edge Symphony: Scaling Beyond Boundaries with Multi-Cluster Controllers

A talk by TikTok/ByteDance (duh) focussed on using central controllers instead of on the edge.

Background

Global means non-china

Edge platform team for CDN, livestreaming, uploads, real-time communication, etc.
Around 250 cluster with 10-600 nodes each - mostly non-cloud aka bare-metal
Architecture: Control plane clusters (platform services) - data plane clusters (workload by other teams)
Platform includes logs, metrics, configs, secrets, …

Challenges

Operators

Operators are essential for platform features
As the feature requests increase, more operators are needed
The deployment of operators throughout many clusters is complex (namespace, deployments, policies, …)

Edge

Limited resources
Cost implication of platform features
Real time processing demands by platform features
Balancing act between resources used by workload vs platform features (20-25%)

The classic flow

New feature gets requested
Use kubebuider with the SDK to create the operator
Create namespaces and configs in all clusters
Deploy operator to all clusters

Possible Solution

Centralized Control Plane

Problem: The controller implementation is limited to a cluster boundary
Idea: Why not create a single operator that can manage multiple edge clusters
Implementation: Just modify kubebuilder to accept multiple clients (and caches)
Result: It works -> Simpler deployment and troubleshooting
Concerns: High code complexity -> Long familiarization
Balance between “simple central operator” and operator-complexity is hard

Attempt it a bit more like kubebuilder

Each cluster has its own manager
There is a central multimanager that starts all the cluster specific manager
Controller registration to the manager now handles cluster names
The reconciler knows which cluster it is working on
The multi cluster management basically just test all the cluster secrets and create a manager+controller for each cluster secret
Challenges: Network connectivity
Solutions:
- Dynamic add/remove of clusters with go channels to prevent pod restarts
- Connectivity health checks -> For loss the recreate manager gets triggered

flowchart TD
    mcm-->m1
    mcm-->m2
    mcm-->m3

flowchart LR
    secrets-->ch(go channels)
    ch-->|CREATE|create(Create manager + Add controller + Start manager)
    ch-->|UPDATE|update(Stop manager + Create manager + Add controller + Start manager)
    ch-->|DELETE|delete(Stop manager)

Conclusion

Acknowledge resource constraints on edge
Embrace open source adoption instead of build your own
Simplify deployment
Recognize your own opinionated approach and it’s use cases