Investigating and fixing "StopPodSandbox from runtime service failed" Kubelet errors
For months now (maybe years, who knows π€·) I've had the following error sprinkled throughout my Kubelet logs across multiple of my worker nodes:
StopPodSandbox from runtime service failed: rpc error: code = Unknown desc = failed to destroy network for sandbox "055b221e44a28ce8d9120f771d5e1ef201f2457ce49c58999a0147104cca2708": failed to get network "cbr0" cached result: decoding version from network config: unexpected end of JSON input
As there has never been any associated impact with this that I could see I just ignored it... for a long time!
Well, fast forward to last Friday night where a sudden hyper-focus hit and I decided I wanted to know what was going on and how to stop it - here's how it went.
A little context
Before we dive into things I think it's worth providing a bit of context first so we all know what we're working with. The Kubernetes cluster in question is my "homelab" cluster that is running Talos Linux within a Proxmox cluster. The cluster has a single control plane node and 5 worker nodes. The worker nodes are where we'll be focussing. Beyond that there's nothing too special about this cluster. CNI is provided by Flannel with default configuration. The only thing that is kinda unusual is my use of SMB for my CSI driver - I only mention this because it's possible that a failure here might have been the thing that trigger the initial issue but I don't know for sure.
Identifying the issue
As I sad, I had ignored this for months and not thought much of it until Friday. I was cleaning up some unrelated stuff in my cluster and was checking my Kubelet logs stored in Loki to see if my cleaning up had been applied. This is how I (re-)noticed the high number of sandbox errors from various Kubelets. So I took a closer look with:
{service="kubelet"} |= "failed to get network \"cbr0\" cached result"
There was a LOT of these errors in the logs, almost 1000 per hour! π³
Even though there was a lot of errors there was actually only a handful of sandbox IDs causing the problem so it wasn't widespread, just frequent.
Figuring out what was wrong
Because I'm a Professional Kubernetes Platform Engineerβ’ I knew exactly how to tackle this... ask the internet! So fired up DuckDuckGo, pasted in my error message and started wading through a LOT of useless results.
- I found a couple similar issues that people had experienced with Calico but didn't seem to fit what I was seeing.
- There was a closed issue against Containerd where this comment gave me some hope in solving it with a loopback CNI but turned out unrelated also.
These did give me enough info to start looking in the right places though. There was just one problem - Talos Linux doesn't have a shell you can SSH into, it's all API driven.
kubectl debug
to the rescue!!!
Thankfully, it's possible to mimic an SSH session using kubectl debug
against the node. As I knew I'd need a bunch of container-related tools to help with debugging I launched a debug container using the alpine-containertools image from Rory McCune. Due to how the debug command mounts the host filesystem I also had to override the CONTAINER_RUNTIME_ENDPOINT
environment variable to point to the correct location.
kubectl debug -it --image raesene/alpine-containertools \
--env CONTAINER_RUNTIME_ENDPOINT=unix:///host/run/containerd/containerd.sock \
node/talos-192-168-1-15 \
-- sh
Now we can start poking around on the node. First, lets see if we can force delete the sandbox pod ourself.
crictl rmp --force 055b221e44a28ce8d9120f771d5e1ef201f2457ce49c58999a0147104cca2708
Failed to stop sandbox before removing: rpc error: code = Unknown desc = failed to destroy network for sandbox "055b221e44a28ce8d9120f771d5e1ef201f2457ce49c58999a0147104cca2708": failed to get network "cbr0" cached result: decoding version from network config: unexpected end of JSON input
No luck! But we've been able to reproduce what Kubelet is hitting at least.
I spent a bit more time poking at crictl
to get an understanding of the state of things. Doing so I was able to identify which pods were broken on each node.
# Get the full details about the pod in containerd
crictl inspectp --output=json 055b221e44a28ce8d9120f771d5e1ef201f2457ce49c58999a0147104cca2708
# List all pods known to containerd, and filter to those in a Notready state
crictl pods | grep NotReady
Eventually, after wading through all the unrelated search results one did seem relevant, a GitHub issue on containerd
: "Empty cache files causes "KillPodSandbox" errors when deleting pods".
Cleaning things up
This final GitHub issue contained one important piece of information I was missing - the location of the network cache: /var/lib/cni/results
. This is the directory where the CNI stores a JSON file for each network it creates. So let's go take a look:
> ls -la /host/var/lib/cni/results
total 148
drwx------ 2 root root 4096 Sep 27 05:48 .
drwx------ 5 root root 52 Jun 23 2022 ..
-rw------- 1 root root 1647 Sep 27 05:47 cbr0-01d576d2f808a969bb7cb8da11d1ee3117ec4e9792d4aea33abec55318b74e01-eth0
-rw------- 1 root root 0 Jun 15 21:28 cbr0-055b221e44a28ce8d9120f771d5e1ef201f2457ce49c58999a0147104cca2708-eth0
-rw------- 1 root root 1608 Sep 27 05:48 cbr0-225486a2df0c2c8be5038245d099b4c48dc03ede648735a2b4676154650e7741-eth0
-rw------- 1 root root 2045 Sep 27 05:46 cbr0-2e848e89543e7b6491d0b47eb2be7ec620865024d942012303ede16842aa1108-eth0
-rw------- 1 root root 1638 Sep 27 05:48 cbr0-415cd18f24ec2740486a7bf254054f871bd8abe4c26bc6e27dbf1f5a10c29f69-eth0
-rw------- 1 root root 1638 Sep 27 05:46 cbr0-47d021d26d2d1935025cee58941ff68ad28e09f1337b867e0865be56dde64a2a-eth0
-rw------- 1 root root 1616 Sep 27 05:46 cbr0-5d40d2f86c6f4be7a85969f24e844ea049b41af14b97cfa3f5a655f52d4fc695-eth0
-rw------- 1 root root 1616 Sep 27 05:48 cbr0-68f252ab057496036e8852a4f6103c1b6c564f6976d04676010858a5588a9a10-eth0
-rw------- 1 root root 1596 Sep 27 05:48 cbr0-91d8805fe4b95c92f4c53e1321813b5fe9f71467c5692328bb7646649db57b22-eth0
-rw------- 1 root root 1764 Sep 27 05:48 cbr0-a0eaa73bd701c9b1b1167efd49487b48074936211de1c6be37beec5184e42691-eth0
Notice anything suspicious? The file size of the file associated with our failing sandbox ID is 0
. π€ That explains why the CNI is unable to load from the cache then - it's empty. Taking a quick look at one of the other files shows a valid JSON object with various network configuration stored within.
So lets delete this file - rm -rf /host/var/lib/cni/results/cbr0-055b221e44a28ce8d9120f771d5e1ef201f2457ce49c58999a0147104cca2708-eth0
With that gone, let's try again to delete the pod:
crictl rmp 055b221e44a28ce8d9120f771d5e1ef201f2457ce49c58999a0147104cca2708
Removed sandbox 055b221e44a28ce8d9120f771d5e1ef201f2457ce49c58999a0147104cca2708
SUCCESS! π
Now we know what to clean up it's time to repeat the process for each filing sandbox ID on each node. Once that is done the kubelet logs are looking muuuucccchhhhhhh better!
Preventing this happening again
As I'm still unclear why or how these pods first failed I can't be sure it wont happen again. My current leading theories are it was either caused by a previous power failure that left the nodes in a bad state or was related to my NAS crashing as the few pods I took a close look at all had SMB volumes associated with them.
Regardless of what caused it I want to make sure I'm being more proactive going forward and leaving my cluster in a better state than it was before. I could set up a dashboard that includes the log query from Loki that we used at the start of this post but that requires me to check it often and... I'm lazy. What we really want is some alerts to annoy me when / if this happens again.
I already have quite a decent monitoring and alerting setup in my homelab made up of VictoriaMetrics, AlertManager and various metric exporters so I'm in a good starting point.
But what I am lacking is containerd metrics.
A quick search lead me to this great post by Abraham Dahunsi covering a lot of details about monitoring containerd. This gave the information on what metrics to work with and what I can do with them but I was still missing the actual metrics. In Talos Linux the containerd metrics aren't enabled by default but thankfully it's quite simple to enable, I just needed to add the following to my machine config for each of my nodes:
machine:
files:
- content: |
[metrics]
address = "0.0.0.0:11234"
path: /etc/cri/conf.d/20-customization.part
op: create
I then just need to instruct vmagent to scrape this new endpoint:
scrape_configs:
- job_name: containerd
scrape_interval: 30s
scheme: https
tls_config:
ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
kubernetes_sd_configs:
- role: node
relabel_configs:
- action: labelmap
regex: __meta_kubernetes_node_label_(.+)
- target_label: __address__
replacement: kubernetes.default.svc:443
- source_labels: [__meta_kubernetes_node_name]
regex: (.+)
target_label: __metrics_path__
replacement: /api/v1/nodes/${1}:11234/proxy/v1/metrics
- source_labels: [__meta_kubernetes_node_name]
action: replace
target_label: kubernetes_node
Now I have the metrics coming in, we can put them to use at identifying issues.
Identify failures starting a pod sandbox:
sum by(grpc_code, instance) (
rate(grpc_server_handled_total{job="containerd", grpc_code!="OK", grpc_method="RunPodSandbox"})
) > 0
Identify failures stopping or removing a pod sandbox: (This is the issue we handled in this post)
sum by(grpc_code, grpc_method, instance) (
rate(grpc_server_handled_total{job="containerd", grpc_code!="OK", grpc_method=~"StopPodSandbox|RemovePodSandbox"})
) > 0
Find all pods known to containerd that aren't known to Kubernetes:
(
container_pids_current{namespace="k8s.io"}
unless on (container_id)
label_replace(
kube_pod_container_info,
"container_id", "$1", "container_id", "containerd://(.+)"
)
unless on (container_id)
label_replace(
kube_pod_init_container_info,
"container_id", "$1", "container_id", "containerd://(.+)"
)
)
unless on (container_id)
(
container_memory_swap_limit_bytes > 0
)
This last one has a lot going on but it breaks down to:
- List all
pids
counts for all containers in thek8s.io
namespace in containerd - Unless the
container_id
is also found on thekube_pod_container_info
metric (where we need to strip thecontainerd://
prefix) - Or the
container_id
is also found on thekube_pod_init_container_info
metric - And finally only if the
container_memory_swap_limit_bytes
is greater than 0 which will filter out the sandbox containers (pause container) which aren't exposed to Kubernetes anyway.
These three then become new alerts that can notify me if any match.
- name: containerd
rules:
- alert: ContainerdPodStartFailed
annotations:
description: |
Containerd on **{β{ .Labels.instance }}** has failed to start a pod sandbox due to **{β{ .Labels.grpc_code }}**.
expr: |
sum by(grpc_code, instance) (rate(grpc_server_handled_total{job="containerd",grpc_code!="OK", grpc_method="RunPodSandbox"})) > 0
for: 5m
labels:
severity: notify
- alert: ContainerdPodRemoveFailed
annotations:
description: |
Containerd on **{β{ .Labels.instance }}** has failed to {β{ .Labels.grpc_method }} due to **{β{ .Labels.grpc_code }}**.
expr: |
sum by(grpc_code, grpc_method, instance) (rate(grpc_server_handled_total{job="containerd", grpc_code!="OK", grpc_method=~"StopPodSandbox|RemovePodSandbox"})) > 0
for: 5m
labels:
severity: notify
- alert: ContainerdHasLeftoverContainersRunning
annotations:
description: |
Containerd on **{β{ .Labels.instance }}** has extra containers that aren't known to Kubernetes.
expr: |
(
container_pids_current{namespace="k8s.io"}
unless on (container_id)
label_replace(
kube_pod_container_info,
"container_id", "$1", "container_id", "containerd://(.+)"
)
unless on (container_id)
label_replace(
kube_pod_init_container_info,
"container_id", "$1", "container_id", "containerd://(.+)"
)
)
unless on (container_id)
(
container_memory_swap_limit_bytes > 0
)
for: 15m
labels:
severity: notify
This I learnt
I'd never really used crictl
much before this so it was good to be able to better understand it. One thing I didn't realise was it also has the concept of "pods" and "containers" which are different, but map to, those found in Kubernetes. A "pod" seems to refer to the sandbox container - that is, the container that is created to setup all the networking etc. that is needed for the workloads (this is the "pause container"). Pods can be listed with crictl pods
and inspected with crictl inspectp [ID]
. Containers can be listed with crictl ps
and inspected with crictl inspect
.
Before this I didn't really know what a CNI actually did in terms of things it created etc., and I still mostly don't, but I've at least seem some hints of it in the way of the network cache files that it creates.
Debugging Talos when it goes wrong at the host level is tricky as you have limited access to the node itself. This is often good as it prevents you from being able to make live changes that could be forgotten or reverted accidentally but when you actually need to fix something like this it just gets in the way. I'm very thankful for kubectl debug
(and Rory's wonderful alpine-containertools
image π) that makes this debugging possible and relatively painless.