What the hell is "PLEG is not health" in Kubelet?

>Original link: Deep understanding of PLEG is not health in Kubelet

In the Kubernetes community, PLEG is not health has been famous for a long time. As long as this error occurs, there is a high probability that the Node status will become NotReady. There are also a large number of community related issue s. Let me list them for you:

In this paper, I will try to explain the working principle of PLEG. As long as I understand the working principle, I will have troubleshooting ideas if I encounter similar problems.

What is PLEG?

The full name of PLEG is Pod Lifecycle Event Generator, that is, Pod Lifecycle Event Generator. In fact, it is only a module in Kubelet. Its main responsibility is to adjust the state of the container runtime through each matched Pod level event, and write the adjusted results to the cache to keep the Pod cache up to date. Let's talk about the background of PLEG.

In Kubernetes, a daemons Kubelet is running on each node to manage the container on the node, and adjust the actual state of the container to match the state defined in the spec. Specifically, Kubelet needs to respond to changes in two places in a timely manner:

  1. State defined in Pod spec
  2. State of the container at run time

For Pod, Kubelet will watch changes in Pod spec from multiple data sources. For containers, Kubelet periodically (for example, 10s) polls the container runtime to get the latest status of all containers.

As the number of pods and containers increases, polling incurs overhead that cannot be ignored, and this overhead is exacerbated by the parallel operation of Kubelet (each Pod is allocated a goruntine to get the status of the container). The periodic and large number of concurrent requests brought by polling will lead to high peak CPU utilization (even if the definition of Pod and the state of container have not changed), and reduce performance. Finally, the container may run under heavy load, which reduces the reliability of the system and limits the scalability of Kubelet.

In order to reduce the management overhead of Pod and improve the performance and scalability of Kubelet, PLEG is introduced to improve the previous working mode:

  • Reduce unnecessary work during idle time (for example, the definition of Pod and the state of container have not changed).
  • Reduce the number of concurrent requests to get container state.

The overall workflow is shown in the figure below, and the dotted part is the work content of PLEG.

How does PLEG is not healthy happen?

Healthy() function will be added to runtimeState in the form of "PLEG", Kubelet will call healthy() function regularly (default is 10s) in a synchronous loop (SyncLoop() function). The healthy() function checks whether the relist process (a key task for PLEG) completes in 3 minutes. If the relist process takes more than 3 minutes to complete, PLEG is not healthy is reported.

I will explain how it works through the source code at each step of the process, which is based on Kubernetes 1.11 (Openshift 3.11). If you are not familiar with the syntax of Go, don't worry, just look at the comments in the code to understand the principle. I will also read the code before releasing it, and cut out the less important content from the source code to improve the readability of the code. Here is the code for calling the health() function:

//// pkg/kubelet/pleg/generic.go - Healthy()

// The threshold needs to be greater than the relisting period + the
// relisting time, which can vary significantly. Set a conservative
// threshold to avoid flipping between healthy and unhealthy.
relistThreshold = 3 * time.Minute
:
func (g *GenericPLEG) Healthy() (bool, error) {
  relistTime := g.getRelistTime()
  elapsed := g.clock.Since(relistTime)
  if elapsed > relistThreshold {
    return false, fmt.Errorf("pleg was last seen active %v ago; threshold is %v", elapsed, relistThreshold)
  }
  return true, nil
}

//// pkg/kubelet/kubelet.go - NewMainKubelet()
func NewMainKubelet(kubeCfg *kubeletconfiginternal.KubeletConfiguration, ...
:
  klet.runtimeState.addHealthCheck("PLEG", klet.pleg.Healthy)

//// pkg/kubelet/kubelet.go - syncLoop()
func (kl *Kubelet) syncLoop(updates <-chan kubetypes.PodUpdate, handler SyncHandler) {
:
// The resyncTicker wakes up kubelet to checks if there are any pod workers
// that need to be sync'd. A one-second period is sufficient because the
// sync interval is defaulted to 10s.
:
  const (
    base   = 100 * time.Millisecond
    max    = 5 * time.Second
    factor = 2
  )
  duration := base
  for {
      if rs := kl.runtimeState.runtimeErrors(); len(rs) != 0 {
          glog.Infof("skipping pod synchronization - %v", rs)
          // exponential backoff
          time.Sleep(duration)
          duration = time.Duration(math.Min(float64(max), factor*float64(duration)))
          continue
      }
    :
  }
:
}

//// pkg/kubelet/runtime.go - runtimeErrors()
func (s *runtimeState) runtimeErrors() []string {
:
    for _, hc := range s.healthChecks {
        if ok, err := hc.fn(); !ok {
            ret = append(ret, fmt.Sprintf("%s is not healthy: %v", hc.name, err))
        }
    }
:
}

In depth interpretation of relist function

As mentioned above, the health () function checks the finish time of relist, but what is relist used for? Before you explain relist, explain the life cycle events of the Pod. The life cycle event of Pod is an abstraction of the state change of the underlying container at the Pod level, making it independent of the runtime of the underlying container, so that Kubelet is not affected by the runtime of the underlying container.

type PodLifeCycleEventType string

const (
    ContainerStarted      PodLifeCycleEventType = "ContainerStarted"
    ContainerStopped      PodLifeCycleEventType = "ContainerStopped"
    NetworkSetupCompleted PodLifeCycleEventType = "NetworkSetupCompleted"
    NetworkFailed         PodLifeCycleEventType = "NetworkFailed"
)

// PodLifecycleEvent is an event reflects the change of the pod state.
type PodLifecycleEvent struct {
    // The pod ID.
    ID types.UID
    // The type of the event.
    Type PodLifeCycleEventType
    // The accompanied data which varies based on the event type.
    Data interface{}
}

Taking Docker as an example, starting an infra container in Pod will register a NetworkSetupCompleted Pod life cycle event in Kubelet.

So how does PLEG know that an infra container has been launched? It will periodically re list all containers on the node (such as docker ps) and compare with the last container list to determine the change of container state. In fact, this is what the relist() function does. Although this method is similar to the previous Kubelet polling, now there is only one thread, PLEG. At present, it is not necessary for all threads to obtain the state of the container concurrently. Only related threads will be woken up to synchronize the state of the container. Moreover, relist has nothing to do with the container runtime, and does not need external dependencies. It is perfect.

Let's take a look at the internal implementation of the relist() function. The complete process is as follows:

Pay attention to the RPC call part in the figure, which will be explained in detail later. Complete source code in Here.

Although relist is called every second, its completion time may still exceed 1s. Because the next call to relist must wait until the end of the last relist execution. Imagine that if the response of the container is slow at runtime, or a large number of container states change in a cycle, the completion time of relist cannot be ignored. Suppose it is 5s, then the next call to relist will wait 6s later.

The relevant source code is as follows:

//// pkg/kubelet/kubelet.go - NewMainKubelet()

// Generic PLEG relies on relisting for discovering container events.
// A longer period means that kubelet will take longer to detect container
// changes and to update pod status. On the other hand, a shorter period
// will cause more frequent relisting (e.g., container runtime operations),
// leading to higher cpu usage.
// Note that even though we set the period to 1s, the relisting itself can
// take more than 1s to finish if the container runtime responds slowly
// and/or when there are many container changes in one cycle.
plegRelistPeriod = time.Second * 1

// NewMainKubelet instantiates a new Kubelet object along with all the required internal modules.
// No initialization of Kubelet and its modules should happen here.
func NewMainKubelet(kubeCfg *kubeletconfiginternal.KubeletConfiguration, ...
:
  klet.pleg = pleg.NewGenericPLEG(klet.containerRuntime, plegChannelCapacity, plegRelistPeriod, klet.podCache, clock.RealClock{})

//// pkg/kubelet/pleg/generic.go - Start()

// Start spawns a goroutine to relist periodically.
func (g *GenericPLEG) Start() {
  go wait.Until(g.relist, g.relistPeriod, wait.NeverStop)
}

//// pkg/kubelet/pleg/generic.go - relist()
func (g *GenericPLEG) relist() {
... WE WILL REVIEW HERE ...
}

Returning to the above figure, the first step of the relist function is to record the relevant indicators of Kubelet (for example, Kubelet \\\\\\\\\\\. The Pod list will be compared with the previous one to check which states have changed, and then generate related Pod life cycle events and changed states at the same time.

//// pkg/kubelet/pleg/generic.go - relist()
  :
  // get a current timestamp
  timestamp := g.clock.Now()

  // kubelet_pleg_relist_latency_microseconds for prometheus metrics
    defer func() {
        metrics.PLEGRelistLatency.Observe(metrics.SinceInMicroseconds(timestamp))
    }()

  // Get all the pods.
    podList, err := g.runtime.GetPods(true)
  :

The call stack of GetPods() function is as follows:

The relevant source code is as follows:

//// pkg/kubelet/kuberuntime/kuberuntime_manager.go - GetPods()

// GetPods returns a list of containers grouped by pods. The boolean parameter
// specifies whether the runtime returns all containers including those already
// exited and dead containers (used for garbage collection).
func (m *kubeGenericRuntimeManager) GetPods(all bool) ([]*kubecontainer.Pod, error) {
    pods := make(map[kubetypes.UID]*kubecontainer.Pod)
    sandboxes, err := m.getKubeletSandboxes(all)
:
}

//// pkg/kubelet/kuberuntime/kuberuntime_sandbox.go - getKubeletSandboxes()

// getKubeletSandboxes lists all (or just the running) sandboxes managed by kubelet.
func (m *kubeGenericRuntimeManager) getKubeletSandboxes(all bool) ([]*runtimeapi.PodSandbox, error) {
:
    resp, err := m.runtimeService.ListPodSandbox(filter)
:
}

//// pkg/kubelet/remote/remote_runtime.go - ListPodSandbox()

// ListPodSandbox returns a list of PodSandboxes.
func (r *RemoteRuntimeService) ListPodSandbox(filter *runtimeapi.PodSandboxFilter) ([]*runtimeapi.PodSandbox, error) {
:
    resp, err := r.runtimeClient.ListPodSandbox(ctx, &runtimeapi.ListPodSandboxRequest{
:
    return resp.Items, nil
}

After getting all the Pod lists, relist's completion time will be updated to the current timestamp. That is, the Healthy() function can evaluate whether relist is longer than 3 minutes based on this timestamp.

//// pkg/kubelet/pleg/generic.go - relist()

  // update as a current timestamp
  g.updateRelistTime(timestamp)

After comparing the current Pod list with the previous relist's, corresponding Pod level events will be generated for each change. The relevant source code is as follows:

//// pkg/kubelet/pleg/generic.go - relist()

  pods := kubecontainer.Pods(podList)
  g.podRecords.setCurrent(pods)

  // Compare the old and the current pods, and generate events.
  eventsByPodID := map[types.UID][]*PodLifecycleEvent{}
  for pid := range g.podRecords {
    oldPod := g.podRecords.getOld(pid)
    pod := g.podRecords.getCurrent(pid)

    // Get all containers in the old and the new pod.
    allContainers := getContainersFromPods(oldPod, pod)
    for _, container := range allContainers {
          events := computeEvents(oldPod, pod, &container.ID)

          for _, e := range events {
                updateEvents(eventsByPodID, e)
          }
        }
  }

The generateEvents() function (called by the computeEvents() function) is used to generate the corresponding Pod level events (such as ContainerStarted, containerdid, etc.), and then update the events through the updateEvents() function.

The contents of the computeEvents() function are as follows:

//// pkg/kubelet/pleg/generic.go - computeEvents()

func computeEvents(oldPod, newPod *kubecontainer.Pod, cid *kubecontainer.ContainerID) []*PodLifecycleEvent {
:
    return generateEvents(pid, cid.ID, oldState, newState)
}

//// pkg/kubelet/pleg/generic.go - generateEvents()

func generateEvents(podID types.UID, cid string, oldState, newState plegContainerState) []*PodLifecycleEvent {
:
    glog.V(4).Infof("GenericPLEG: %v/%v: %v -> %v", podID, cid, oldState, newState)
    switch newState {
    case plegContainerRunning:
      return []*PodLifecycleEvent{{ID: podID, Type: ContainerStarted, Data: cid}}
    case plegContainerExited:
      return []*PodLifecycleEvent{{ID: podID, Type: ContainerDied, Data: cid}}
    case plegContainerUnknown:
      return []*PodLifecycleEvent{{ID: podID, Type: ContainerChanged, Data: cid}}
    case plegContainerNonExistent:
      switch oldState {
      case plegContainerExited:
        // We already reported that the container died before.
        return []*PodLifecycleEvent{{ID: podID, Type: ContainerRemoved, Data: cid}}
      default:
        return []*PodLifecycleEvent{{ID: podID, Type: ContainerDied, Data: cid}, {ID: podID, Type: ContainerRemoved, Data: cid}}
      }
    default:
      panic(fmt.Sprintf("unrecognized container state: %v", newState))
  }
}

The last task of relist is to check whether there are events associated with Pod and update podCache according to the following process.

//// pkg/kubelet/pleg/generic.go - relist()

  // If there are events associated with a pod, we should update the
  // podCache.
  for pid, events := range eventsByPodID {
    pod := g.podRecords.getCurrent(pid)
    if g.cacheEnabled() {
      // updateCache() will inspect the pod and update the cache. If an
      // error occurs during the inspection, we want PLEG to retry again
      // in the next relist. To achieve this, we do not update the
      // associated podRecord of the pod, so that the change will be
      // detect again in the next relist.
      // TODO: If many pods changed during the same relist period,
      // inspecting the pod and getting the PodStatus to update the cache
      // serially may take a while. We should be aware of this and
      // parallelize if needed.
      if err := g.updateCache(pod, pid); err != nil {
        glog.Errorf("PLEG: Ignoring events for pod %s/%s: %v", pod.Name, pod.Namespace, err)
        :
      }
      :
    }
    // Update the internal storage and send out the events.
    g.podRecords.update(pid)
    for i := range events {
      // Filter out events that are not reliable and no other components use yet.
      if events[i].Type == ContainerChanged {
           continue
      }
      g.eventChannel <- events[i]
     }
  }

updateCache() checks each Pod and updates it in turn in a single loop. Therefore, if a large number of pods are changed in the same relist, the updateCache process will become a bottleneck. Finally, updated Pod lifecycle events are sent to eventChannel.

Some remote clients will also call each Pod to get the spec definition information of the Pod, so the more pods there are, the higher the latency may be, because the more pods there are, the more events will be generated.

The detailed call stack of updateCache() is shown in the following figure, where GetPodStatus() is used to get the spec definition information of Pod:

The complete code is as follows:

//// pkg/kubelet/pleg/generic.go - updateCache()

func (g *GenericPLEG) updateCache(pod *kubecontainer.Pod, pid types.UID) error {
:
    timestamp := g.clock.Now()
    // TODO: Consider adding a new runtime method
    // GetPodStatus(pod *kubecontainer.Pod) so that Docker can avoid listing
    // all containers again.
    status, err := g.runtime.GetPodStatus(pod.ID, pod.Name, pod.Namespace)
  :
    g.cache.Set(pod.ID, status, err, timestamp)
    return err
}

//// pkg/kubelet/kuberuntime/kuberuntime_manager.go - GetPodStatus()

// GetPodStatus retrieves the status of the pod, including the
// information of all containers in the pod that are visible in Runtime.
func (m *kubeGenericRuntimeManager) GetPodStatus(uid kubetypes.UID, name, namespace string) (*kubecontainer.PodStatus, error) {
  podSandboxIDs, err := m.getSandboxIDByPodUID(uid, nil)
  :
    for idx, podSandboxID := range podSandboxIDs {
        podSandboxStatus, err := m.runtimeService.PodSandboxStatus(podSandboxID)
    :
    }

    // Get statuses of all containers visible in the pod.
    containerStatuses, err := m.getPodContainerStatuses(uid, name, namespace)
  :
}

//// pkg/kubelet/kuberuntime/kuberuntime_sandbox.go - getSandboxIDByPodUID()

// getPodSandboxID gets the sandbox id by podUID and returns ([]sandboxID, error).
// Param state could be nil in order to get all sandboxes belonging to same pod.
func (m *kubeGenericRuntimeManager) getSandboxIDByPodUID(podUID kubetypes.UID, state *runtimeapi.PodSandboxState) ([]string, error) {
  :
  sandboxes, err := m.runtimeService.ListPodSandbox(filter)
  :  
  return sandboxIDs, nil
}

//// pkg/kubelet/remote/remote_runtime.go - PodSandboxStatus()

// PodSandboxStatus returns the status of the PodSandbox.
func (r *RemoteRuntimeService) PodSandboxStatus(podSandBoxID string) (*runtimeapi.PodSandboxStatus, error) {
    ctx, cancel := getContextWithTimeout(r.timeout)
    defer cancel()

    resp, err := r.runtimeClient.PodSandboxStatus(ctx, &runtimeapi.PodSandboxStatusRequest{
        PodSandboxId: podSandBoxID,
    })
  :
    return resp.Status, nil
}

//// pkg/kubelet/kuberuntime/kuberuntime_container.go - getPodContainerStatuses()

// getPodContainerStatuses gets all containers' statuses for the pod.
func (m *kubeGenericRuntimeManager) getPodContainerStatuses(uid kubetypes.UID, name, namespace string) ([]*kubecontainer.ContainerStatus, error) {
  // Select all containers of the given pod.
  containers, err := m.runtimeService.ListContainers(&runtimeapi.ContainerFilter{
    LabelSelector: map[string]string{types.KubernetesPodUIDLabel: string(uid)},
  })
  :
  // TODO: optimization: set maximum number of containers per container name to examine.
  for i, c := range containers {
    status, err := m.runtimeService.ContainerStatus(c.Id)
    :
  }
  :
  return statuses, nil
}

The above is the complete call stack of relist() function. I have combined the relevant source code in the process of explanation. I hope to provide you with more details about PLEG. In order to know the health status of PLEG in real time, the best way is to monitor relist.

< span id = "inline TOC" > 4. < / span > monitor relist

We can understand the delay of relist by monitoring the indicators of Kubelet. If the call cycle of relist is 1s, then the completion time + 1s of relist is equal to the value of the Kubelet [PLEG] relist [interval] microseconds indicator. You can also monitor the delay of each operation when the container is running. These indicators can provide clues when troubleshooting.

You can get the indicators of Kubelet on each node by visiting the URL https://127.0.0.1:10250/metrics.

# HELP kubelet_pleg_relist_interval_microseconds Interval in microseconds between relisting in PLEG.
# TYPE kubelet_pleg_relist_interval_microseconds summary
kubelet_pleg_relist_interval_microseconds{quantile="0.5"} 1.054052e+06
kubelet_pleg_relist_interval_microseconds{quantile="0.9"} 1.074873e+06
kubelet_pleg_relist_interval_microseconds{quantile="0.99"} 1.126039e+06
kubelet_pleg_relist_interval_microseconds_count 5146

# HELP kubelet_pleg_relist_latency_microseconds Latency in microseconds for relisting pods in PLEG.
# TYPE kubelet_pleg_relist_latency_microseconds summary
kubelet_pleg_relist_latency_microseconds{quantile="0.5"} 53438
kubelet_pleg_relist_latency_microseconds{quantile="0.9"} 74396
kubelet_pleg_relist_latency_microseconds{quantile="0.99"} 115232
kubelet_pleg_relist_latency_microseconds_count 5106

# HELP kubelet_runtime_operations Cumulative number of runtime operations by operation type.
# TYPE kubelet_runtime_operations counter
kubelet_runtime_operations{operation_type="container_status"} 472
kubelet_runtime_operations{operation_type="create_container"} 93
kubelet_runtime_operations{operation_type="exec"} 1
kubelet_runtime_operations{operation_type="exec_sync"} 533
kubelet_runtime_operations{operation_type="image_status"} 579
kubelet_runtime_operations{operation_type="list_containers"} 10249
kubelet_runtime_operations{operation_type="list_images"} 782
kubelet_runtime_operations{operation_type="list_podsandbox"} 10154
kubelet_runtime_operations{operation_type="podsandbox_status"} 315
kubelet_runtime_operations{operation_type="pull_image"} 57
kubelet_runtime_operations{operation_type="remove_container"} 49
kubelet_runtime_operations{operation_type="run_podsandbox"} 28
kubelet_runtime_operations{operation_type="start_container"} 93
kubelet_runtime_operations{operation_type="status"} 1116
kubelet_runtime_operations{operation_type="stop_container"} 9
kubelet_runtime_operations{operation_type="stop_podsandbox"} 33
kubelet_runtime_operations{operation_type="version"} 564

# HELP kubelet_runtime_operations_latency_microseconds Latency in microseconds of runtime operations. Broken down by operation type.
# TYPE kubelet_runtime_operations_latency_microseconds summary
kubelet_runtime_operations_latency_microseconds{operation_type="container_status",quantile="0.5"} 12117
kubelet_runtime_operations_latency_microseconds{operation_type="container_status",quantile="0.9"} 26607
kubelet_runtime_operations_latency_microseconds{operation_type="container_status",quantile="0.99"} 27598
kubelet_runtime_operations_latency_microseconds_count{operation_type="container_status"} 486
kubelet_runtime_operations_latency_microseconds{operation_type="list_containers",quantile="0.5"} 29972
kubelet_runtime_operations_latency_microseconds{operation_type="list_containers",quantile="0.9"} 47907
kubelet_runtime_operations_latency_microseconds{operation_type="list_containers",quantile="0.99"} 80982
kubelet_runtime_operations_latency_microseconds_count{operation_type="list_containers"} 10812
kubelet_runtime_operations_latency_microseconds{operation_type="list_podsandbox",quantile="0.5"} 18053
kubelet_runtime_operations_latency_microseconds{operation_type="list_podsandbox",quantile="0.9"} 28116
kubelet_runtime_operations_latency_microseconds{operation_type="list_podsandbox",quantile="0.99"} 68748
kubelet_runtime_operations_latency_microseconds_count{operation_type="list_podsandbox"} 10712
kubelet_runtime_operations_latency_microseconds{operation_type="podsandbox_status",quantile="0.5"} 4918
kubelet_runtime_operations_latency_microseconds{operation_type="podsandbox_status",quantile="0.9"} 15671
kubelet_runtime_operations_latency_microseconds{operation_type="podsandbox_status",quantile="0.99"} 18398
kubelet_runtime_operations_latency_microseconds_count{operation_type="podsandbox_status"} 323

It can be monitored through Prometheus:

< span id = "inline TOC" > 5. < / span > summary

In my experience, there are many factors that cause PLEG is not healthy, and I believe there are more potential factors we haven't met. I can only offer a few reasons that I can think of:

  • The container runtime response timed out during an RPC call (possibly due to performance degradation, deadlock, or bug).
  • There are too many pods on the node for relist to complete in 3 minutes. The number of events and latency is proportional to the number of pods, independent of node resources.
  • relist has a deadlock , which has been fixed in Kubernetes 1.14.
  • CNI encountered a bug while getting the network stack information of Pod.

< span id = "inline TOC" > 6. < / span > references

WeChat public address

Sweep the following two-dimensional code, pay attention to the WeChat public number, reply to the public number in the public address, join the group of cloud native communication, and discuss with Sun Hongliang, Zhang Guanchang, Yangming and other big guys to explore the cloud original technology.

Tags: kubelet Kubernetes github Docker

Posted on Mon, 10 Feb 2020 23:03:04 -0800 by Harry57