TikTok’s Edge Symphony: Scaling Beyond Boundaries with Multi-Cluster Controllers

A talk by TikTok/ByteDance (duh) focussed on using central controllers instead of on the edge.

Background

Global means non-china

  • Edge platform team for CDN, livestreaming, uploads, real-time communication, etc.
  • Around 250 cluster with 10-600 nodes each - mostly non-cloud aka bare-metal
  • Architecture: Control plane clusters (platform services) - data plane clusters (workload by other teams)
  • Platform includes logs, metrics, configs, secrets, …

Challenges

Operators

  • Operators are essential for platform features
  • As the feature requests increase, more operators are needed
  • The deployment of operators throughout many clusters is complex (namespace, deployments, policies, …)

Edge

  • Limited resources
  • Cost implication of platform features
  • Real time processing demands by platform features
  • Balancing act between resources used by workload vs platform features (20-25%)

The classic flow

  1. New feature gets requested
  2. Use kubebuider with the SDK to create the operator
  3. Create namespaces and configs in all clusters
  4. Deploy operator to all clusters

Possible Solution

Centralized Control Plane

  • Problem: The controller implementation is limited to a cluster boundary
  • Idea: Why not create a single operator that can manage multiple edge clusters
  • Implementation: Just modify kubebuilder to accept multiple clients (and caches)
  • Result: It works -> Simpler deployment and troubleshooting
  • Concerns: High code complexity -> Long familiarization
  • Balance between “simple central operator” and operator-complexity is hard

Attempt it a bit more like kubebuilder

  • Each cluster has its own manager
  • There is a central multimanager that starts all the cluster specific manager
  • Controller registration to the manager now handles cluster names
  • The reconciler knows which cluster it is working on
  • The multi cluster management basically just test all the cluster secrets and create a manager+controller for each cluster secret
  • Challenges: Network connectivity
  • Solutions:
    • Dynamic add/remove of clusters with go channels to prevent pod restarts
    • Connectivity health checks -> For loss the recreate manager gets triggered
flowchart TD
    mcm-->m1
    mcm-->m2
    mcm-->m3
flowchart LR
    secrets-->ch(go channels)
    ch-->|CREATE|create(Create manager + Add controller + Start manager)
    ch-->|UPDATE|update(Stop manager + Create manager + Add controller + Start manager)
    ch-->|DELETE|delete(Stop manager)

Conclusion

  • Acknowledge resource constraints on edge
  • Embrace open source adoption instead of build your own
  • Simplify deployment
  • Recognize your own opinionated approach and it’s use cases