Kubernetes resource topology aware scheduling optimization

Tencent cloud native2022-06-23 20:19:16


Star computing team , The star computing platform is based on in-depth optimization of cloud native unified access and multi cloud scheduling , Reinforced vessel operating isolation , Mining technology incremental value , The platform carries the internal of Tencent CPU And heterogeneous computing services , It is a large-scale offline operation within Tencent 、 Unified resource scheduling platform .


The origin of the problem

In recent years , With the continuous development of Tencent's internal self-developed cloud project , More and more businesses are using cloud native methods to host their workloads , Therefore, the scale of the container platform is increasing . With Kubernetes The cloud native technology for the base has greatly promoted the development of the cloud native field , It has become the de facto technical standard for major container platforms . In the cloud native scene , To maximize resource sharing , A single host often runs computing tasks of multiple different users . If there is no refined resource isolation in the host , During peak hours of business load , Multiple containers often generate fierce competition for resources , It may lead to a sharp decline in program performance , Mainly reflected in :

  1. Frequent context switching time during resource scheduling
  2. Frequent process switching CPU Cache invalidation

therefore , In the cloud native scenario, the container resource allocation needs to be refined , Make sure that the CPU High utilization , There will be no fierce competition between containers, which will lead to performance degradation .

Scheduling scenarios

Tencent star computing power platform carries the company's CPU and GPU Computing services , It has a large number of different types of computing resources . At present , Most of the key services carried by the platform are offline , Under the increasing demand of business for computing power , Provide a steady stream of low-cost resources , Continuously improve availability 、 Service quality 、 Dispatch capability , Cover more business scenarios . However ,Kubernetes The native scheduling and resource binding functions can no longer meet the complex computing scenarios , There is an urgent need for more refined scheduling of resources , Mainly reflected in :

  1. Kubernetes The native scheduler cannot perceive the node resource topology information, resulting in Pod Production failure

kube-scheduler The resource topology of the node is not aware during the scheduling process , When kube-scheduler take Pod After scheduling to a node ,kubelet If it is found that the resource topology affinity requirements of the node cannot be met , Will refuse to produce this Pod, When passing through the external control ring ( Such as deployment) To deploy Pod when , Will lead to Pod Created over and over again --> Dispatch --> An endless cycle of production failure .

  1. The node is actually available due to the mixed part scheme based on offline virtual machine CPU The number of cores changes

Facing the reality that the average utilization rate of virtual machines running online services is low , To make full use of free resources , Offline virtual machines and online virtual machines can be deployed in a mixed manner , Solve the company's offline computing needs , Improve the average utilization rate of cloud resources in self-research . Under the condition that offline service does not interfere with online service , Tencent star computing power is based on self-developed kernel scheduler VMF Support for , Can make full use of idle time resources on a machine , Production of low priority offline virtual machines . because VMF The unfair scheduling strategy of , The actual number of available cores of the offline virtual machine is affected by the online virtual machine , As the online business becomes busier and busier . therefore ,kubelet adopt cadvisor Collected in the offline host CPU The number of cores is not accurate , This leads to the deviation of scheduling information .

  1. Efficient utilization of resources requires more refined scheduling granularity

kube-scheduler Your duty is to Pod Choose the right one Node Complete a schedule . However , Want to make more efficient use of resources , The capabilities of the native scheduler are far from enough . When scheduling , We want the scheduler to be able to schedule more finely , Such as being able to sense CPU The core 、GPU Topology 、 Network topology and so on , Make resource utilization more reasonable .

Preliminary knowledge

cgroups And cpuset Subsystem

cgroups yes Linux The kernel provides a mechanism to limit the resources used by a single process or multiple processes , It can be done to CPU、 Memory and other resources to achieve fine control .Linux The container technology under the is mainly through cgroups To achieve resource control .

stay cgroups in ,cpuset The subsystem can be cgroups The processes in are assigned independent CPU And memory nodes . By way of CPU The core number is written in cpuset In the subsystem cpuset.cpus File or memory NUMA Number writing cpuset.mems In file , You can restrict a process or group of processes to use only specific CPU Or memory .

Fortunately, , In the resource constraints of the container , We don't need manual operation cpuset Subsystem . Run time by connecting containers (CRI) Provided interface , You can update the resource limit of the container directly .

// ContainerManager contains methods to manipulate containers managed by a
// container runtime. The methods are thread-safe.
type ContainerManager interface {
// ......
// UpdateContainerResources updates the cgroup resources for the container.
UpdateContainerResources(containerID string, resources *runtimeapi.LinuxContainerResources) error
// ......

NUMA framework

Non uniform memory access architecture ( English :Non-uniform memory access, abbreviation NUMA) It's a memory architecture designed for multiprocessor computers , Memory access time depends on the location of memory relative to the processor . stay NUMA Next , The processor accesses its own local memory faster than non local memory ( Memory is on another processor , Or shared memory between processors ) faster . Most modern multi-core servers use NUMA Architecture to improve the scalability of hardware .

As you can see from the diagram , Every NUMA Node There are independent CPU The core 、L3 cache And memory ,NUMA Node They are connected to each other . identical NUMA Node Upper CPU Can be Shared L3 cache, Visit this at the same time NUMA Node Memory speed on is faster , Span NUMA Node Accessing memory is slower . therefore , We should work for CPU Intensive applications are assigned the same NUMA Node Of CPU The core , Ensure that the local performance of the program is fully met .

Kubernetes Scheduling framework

Kubernetes since v1.19 Start formal stable support Scheduling framework , The scheduling framework is oriented to Kubernetes A plug-in architecture of scheduler , It adds a new set of to the existing scheduler “ plug-in unit ”API, The plug-in will be compiled into the scheduler . This brings good news to our custom scheduler . We can do it without modification kube-scheduler Source code , By implementing different scheduling plug-ins , Compare the plug-in code with kube-scheduler Compile into the same executable file , So as to develop a custom extension scheduler . This flexible extension facilitates the development and configuration of various scheduler plug-ins , No modification required kube-scheduler The method of source code enables the extension scheduler to quickly change dependencies , Update to the latest community version .

The main extension points of the scheduler are shown in the figure above . Our extended scheduler is mainly concerned with the following steps :

  1. PreFilter and Filter

These two plug-ins are used to filter out that the Pod The node of , If anything Filter The plug-in marks the node as infeasible , This node will not enter the candidate set , Continue the following scheduling process .

  1. PreScoreScore and NormalizeScore

These three plug-ins are used to sort the nodes that pass through the filtering phase , The scheduler will invoke each scoring plug-in for each node , The node with the highest final score will be selected as the final scheduling result .

  1. Reserve and Unreserve

This plug-in is used in Pod Before it is actually bound to the node , Do some reserved work for resources , Ensure the consistency of scheduling . If the binding fails, it passes Unreserve To release the reserved resources .

  1. Bind

This plug-in is used to put Pod Bind to a node . The default binding plug-in only specifies spec.nodeName To complete the dispatch , If we need to extend the scheduler , Add other scheduling result information , You need to disable the default Bind plug-in unit , Replace with custom Bind plug-in unit .

Research status of technology at home and abroad

at present Kubernetes Community 、Volcano The open source community has solutions related to topology awareness , Each scheme has some similarities , But each has its limitations , Complex scenes that cannot meet the calculation power of stars .

Kubernetes Community

Kubernetes Community scheduling The interest group also has a set of solutions for topology aware scheduling , This scheme is mainly composed of RedHat To dominate , adopt scheduler-plugins and node-feature-discovery The scheduling method considering node topology is realized . The community approach only considers whether the nodes can meet the kubelet Configuration requirements , Complete the screening and scoring of scheduling nodes , Does not perform binding , The binding operation is still left to kubelet To complete , Relevant proposals are in here . The specific implementation scheme is as follows :

  1. nodes nfd-topology-updater adopt gRPC Report the node topology to nfd-master in ( cycle 60s).

  2. nfd-master Update the node topology and allocation to CR in (NodeResourceTopology).

  3. Expand kube-scheduler, Consider when scheduling NodeTopology.

  4. node kubelet Complete the binding work .

There are many problems in this scheme , Can not solve the needs in production practice :

  1. The specific core allocation depends on kubelet complete , Therefore, the scheduler only considers the resource topology information , Topology is not selected , The scheduler has no resource reservation . This results in that node scheduling and topology scheduling are not in the same link , It will cause data inconsistency .

  2. Because the specific core allocation depends on kubelet complete , So it has been scheduled Pod The topology information needs to depend on nfd-worker every other 60s Report once , This causes topology discovery to be too slow, thus making data inconsistency more serious , See here .

  3. There is no distinction between those requiring topological affinity pod And ordinary pod, It is easy to waste the high-quality resources of nodes that enable the topology function .

Volcano Community

Volcano Is in Kubernetes Container batch computing engine running high-performance workloads on , Affiliated to the CNCF Incubation projects . stay v1.4.0-Beta Enhanced in version , Published about NUMA The nature of perception . And Kubernetes Community scheduling Interest groups are implemented in a similar way , The real binding core is not implemented alone , Directly using kubelet Built in features . The specific implementation scheme is as follows :

  1. resource-exporter Is deployed on each node DaemonSet, Be responsible for node topology information collection , And write the node information to CR in (Numatopology).
  2. Volcano According to the node's Numatopology, On schedule Pod When an NUMA Scheduling awareness .
  3. node kubelet Complete the binding work .

The problems of this scheme are basically the same as Kubernetes Community scheduling Interest groups are implemented in a similar way , The specific core allocation depends on kubelet complete . Although the scheduler tries to keep up with kubelet Agreement , But because we can't reserve resources , There will still be inconsistencies , This is especially true in high concurrency scenarios .


Based on the results of the research status at home and abroad , The open source community still hopes to hand over the binding of node resources to kubelet, The scheduler tries to ensure that it is consistent with kubelet The consistency of , It is understandable that this is more in line with the direction of the community . therefore , At present, the typical implementation of each scheme is not perfect , Unable to meet the requirements of Tencent star computing power , In a complex production environment, we need a more robust 、 More scalable solutions . therefore , We decided to start from the architectural advantages of each scheme , Explore a more powerful 、 An enhanced resource fine scheduling scheme that fits the actual scenario of Tencent star Computing .

Problem analysis

Offline virtual machine node is actually available CPU The number of cores changes

from 1.2 In this section we can know , Tencent Xingchen computing power uses a hybrid solution based on offline virtual machines , Nodes are actually CPU The number of available cores will change due to the peak impact of online business . therefore ,kubelet adopt cadvisor Collected in the offline host CPU The number of cores is not accurate , This value is a fixed value . therefore , For offline resources, we need the scheduler to obtain the actual computing power of nodes through other ways .

At present, neither scheduling nor binding can reach the actual computing power of offline virtual machines , This causes the task to be bound to those with serious online interference NUMA node, The resource competition is very serious, which degrades the performance of the task .

Fortunately, , We can collect each of the offline virtual machines on the physical machine NUMA node Actually available on CPU Proportion of resources , The actual computing power of the off-line virtual machine is calculated through the loss formula . Next, you just need to make the scheduler aware of CPU Topology and actual computing power , To allocate .

Fine scheduling requires greater flexibility

adopt kubelet Self contained cpumanager Binding cores will always bind all nodes on the node Pod All in effect . as long as Pod Satisfy Guaranteed Of QoS Conditions , And CPU The request value is an integer , Will be bound . However , There are some Pod Not a high load type but exclusive CPU, In this way, it is easy to waste the high-quality resources of the nodes that enable the topology function .

meanwhile , For nodes of different resource types , The requirements for topology awareness are also different . for example , The resource pool of star computing power also contains more fragmented virtual machines , This part of nodes is not produced in a mixed mode , Comparatively speaking, the resources are stable , But the size is very small ( Such as 8 nucleus CVM, Every NUMA Node Yes 4 nucleus ). Because most mission specifications exceed 4 nucleus , Such resources can be used across NUMA Node Distribute , Otherwise it will be difficult to match .

therefore , Topology aware scheduling requires greater flexibility , Adapt to various core allocation and topology aware scenarios .

The scheduling scheme needs to be more extensible

The scheduler abstracts topology resources , Scalability needs to be considered . For the extended resources that may need to be scheduled in the future , Such as the scheduling of various heterogeneous resources , It can also be easily used in this scheme , Not just cgroups Resources contained in the subsystem .

Avoid the effects of hyper threading CPU Competition problem

stay CPU When the core competition is fierce , Hyper threading can lead to worse performance . A more ideal allocation is to allocate a logical core to a high load application , Another logical core is assigned to applications that are not busy , Or two applications with opposite peak and valley times are allocated to the same physical core . meanwhile , We avoid assigning the same application to two logical cores of the same physical core , This is likely to cause CPU Competition problem .


In order to fully solve the above problems , And consider future scalability , We designed a scheme of fine scheduling , Name it cassini. The complete solution consists of three components and a CRD, Cooperate with each other to complete the work of fine resource scheduling .

notes :cassini The name comes from the famous Saturn probe Cassini - Huygens , Made an accurate probe of Saturn , This name symbolizes more accurate topology discovery and scheduling .

Overall framework

The responsibilities of each module are as follows :

  • cassini-worker: Collect node resource topology 、 Perform resource binding , As DaemonSet Run on each node .

  • cassini-master: Collect node characteristics from external systems ( Such as node offline_capacity, Node power conditions ), As controller use Deployment Way to run .

  • scheduler-plugins: Add an extended scheduler of the scheduling plug-in to replace the native scheduler , The topology scheduling results will be allocated while the nodes are bound , As static Pod At every master Run on node .

The overall scheduling process is as follows :

  1. cassini-worker start-up , Collect topology resource information on nodes .

  2. Create or update NodeResourceTopology(NRT) Type of CR resources , Used to record node topology information .

  3. Read kubelet Of cpu_manager_state file , Add the existing container kubelet Result of binding nucleus patch To Pod annotations in .

  4. cassini-master Update the information of the corresponding node according to the information obtained by the external system NRT object .

  5. Extension scheduler scheduler-plugins perform Pod Dispatch , according to NRT Objects perceive the topology information of nodes , Dispatch Pod Write the topology scheduling structure to Pod annotations in .

  6. node kubelet Listen and prepare to start Pod.

  7. node kubelet Call the container runtime interface to start the container .

  8. cassini-worker Visit periodically kubelet Of 10250 Port to List nodes Pod And from Pod annotations Get the topology scheduling results of the scheduler in .

  9. cassini-worker Call the container runtime interface to change the binding result of the container .

It can be seen from the whole ,cassini-worker Collect more detailed resource topology information on nodes ,cassini-master Obtain additional information about node resources from external systems .scheduler-plugins Extends the native scheduler , Take these additional information as the basis for decision-making to carry out more refined scheduling , And write the results to Pod annotations in . Final ,cassini-worker And assumed the responsibility of the executor , Implement the resource topology scheduling results of the scheduler .

API Design

NodeResourceTopology(NRT) It is used to abstractly describe node resource topology information Kubernetes CRD, Here we mainly refer to Kubernetes Community scheduling Design of interest groups . every last Zone Used to describe an abstract topological region ,ZoneType To describe its type ,ResourceInfo To describe Zone Total resources in the .

// Zone represents a resource topology zone, e.g. socket, node, die or core.
type Zone struct {
// Name represents the zone name.
// +required
Name string `json:"name" protobuf:"bytes,1,opt,name=name"`
// Type represents the zone type.
// +kubebuilder:validation:Enum=Node;Socket;Core
// +required
Type ZoneType `json:"type" protobuf:"bytes,2,opt,name=type"`
// Parent represents the name of parent zone.
// +optional
Parent string `json:"parent,omitempty" protobuf:"bytes,3,opt,name=parent"`
// Costs represents the cost between different zones.
// +optional
Costs CostList `json:"costs,omitempty" protobuf:"bytes,4,rep,name=costs"`
// Attributes represents zone attributes if any.
// +optional
Attributes map[string]string `json:"attributes,omitempty" protobuf:"bytes,5,rep,name=attributes"`
// Resources represents the resource info of the zone.
// +optional
Resources *ResourceInfo `json:"resources,omitempty" protobuf:"bytes,6,rep,name=resources"`

be aware , For greater scalability , Every Zone Add a Attributes To describe Zone Custom properties on . Such as 4.1 Section , We write the collected actual computing power of the offline virtual machine into Attributes Field , To describe each NUMA Node Actual available computing power .

Scheduler design

The extension scheduler extends a new plug-in based on the native scheduler , Roughly as follows :

  1. Filter: Read NRT resources , According to the actual available computing power in each topology and Pod Topology aware requirements to filter nodes and select topology .

  2. Score: according to Zone Number of points to score ,Zone The more points, the lower ( Span Zone There's a performance penalty ).

  3. Reserve: Reserve resources before binding , Avoid data inconsistency ,kube-scheduler Of cache There is a similar assume function .

  4. Bind: Disable default Bind plug-in unit , stay Bind Time to join Zone The result of our choice , Attached to annotations in .

adopt TopologyMatch The plug-in enables the scheduler to consider the node topology information and allocate the topology when scheduling , And pass Bind The plug-in appends the results to annotations in .

It is worth mentioning that , In addition, a scheduler plug-in with more dimensions such as node power dispatching is implemented here .

master Design

cassini-master It is the central control component , Collect resource information that cannot be collected on some nodes from the outside . We physically collect the actual available computing power of the offline virtual machine , from cassini-master Responsible for attaching such results to the corresponding node NRT Resources . This component splits the function of unified resource collection , Easy to update and expand .

worker Design

cassini-worker Is a more complex component , As DaemonSet Run on each node . Its duties are divided into two parts :

  1. Collect topology resources on nodes .
  2. Execute the topology scheduling results of the scheduler .

Resource collection

The resource topology is collected mainly from /sys/devices Collect the hardware information related to the system , And create or update to NRT Resources . The component will watch node kubelet Configuration information and report , Make the scheduler aware of the node's kubelet The binding strategy 、 Reserved resources and other information .

Because the hardware information hardly changes , By default, it will take a long time to collect and update . but watch Configured events are real-time , once kubelet After configuration, you will immediately feel , It is convenient for the scheduler to make different decisions according to the configuration of nodes .

Topology scheduling result execution

The topology scheduling results are executed periodically reconcile To complete the topology allocation of the container .

  1. obtain Pod Information

To prevent each node from cassini-worker all watch kube-apiserver cause kube-apiserver The pressure of the ,cassini-worker Use periodic access instead kubelet Of 10250 Port mode , Come on List nodes Pod And from Pod annotations Get the topology scheduling results of the scheduler in . meanwhile , from status You can also get the of each container in ID And state , Conditions are created for the allocation of topology resources .

  1. Record kubelet Of CPU The binding information

stay kubelet Turn on CPU Core binding , The extension scheduler will skip all TopologyMatch plug-in unit . here Pod annotations Topology scheduling results will not be included in . stay kubelet by Pod complete CPU After core binding , The results will be recorded in cpu_manager_state In file ,cassini-worker Read the file , And will kubelet The binding result of patch To Pod annotations in , For subsequent dispatching to judge .

  1. Record CPU The binding information

according to cpu_manager_state file , And from annotations From Pod Topology scheduling results , Generate one's own cassini_cpu_manager_state file , This file records all on the node Pod The core binding result of .

  1. Perform topology assignment

according to cassini_cpu_manager_state file , Call the container runtime interface , Complete the final container core binding .

Optimization results

According to the above refined scheduling scheme , We tested some online tasks . before , The user feedback that the computing performance is poor after the task is scheduled to some nodes , And because steal_time Elevation is frequently expelled . After replacing the solution with topology aware scheduling , Because topology aware scheduling can sense each NUMA Offline actual computing power of nodes (offline_capacity), The task will be scheduled to the appropriate NUMA Node , The training speed of the test task can be increased to the original 3 times , With the business in Gaoyou CVM It takes quite , And the training speed is relatively stable , More rational use of resources .

meanwhile , In the case of using a native scheduler , The scheduler cannot perceive the actual computing power of the offline virtual machine . When a task is scheduled to a node , The node steal_time Will therefore rise , The task cannot tolerate such a busy node and will be initiated by the evictor Pod Expulsion of . In this case , If you use a native scheduler , It will cause repeated eviction and then repeated scheduling , Lead to SLA Have a great impact . After the solution described in this article , Can be CPU The eviction rate of preemption has dropped to the level of physical machines .

Summary and prospect

This article starts from the actual business pain point , Firstly, it briefly introduces the business scenario of Tencent star computing power and various background knowledge related to fine scheduling , Then fully investigate the research status at home and abroad , It is found that the existing solutions have limitations . Finally, through the analysis of the pain point problem, the corresponding solutions are given . After optimization , More rational use of resources , The training speed of the original test task can be increased to the original 3 times ,CPU The eviction rate of preemption is greatly reduced to the level of physical machines .

future , Fine scheduling will cover more scenarios , Included in GPU Virtualization technology GPU Whole card 、 Card breaking scheduling , Support the scheduling of high-performance network architecture , Dispatching of power resources , Support overbooking scenarios , Cooperate with the kernel scheduling to complete and so on .

本文为[Tencent cloud native]所创,转载请带上原文链接,感谢