OKD worker has node lost status

I just checked OKD cluster in our development environment, I saw that OKD worker has node lost status.

Root cause analysis
First, we can check service status.
# systemctl status origin-node

And here's the output
● origin-node.service - OpenShift Node
   Loaded: loaded (/etc/systemd/system/origin-node.service; enabled; vendor preset: disabled)
   Active: activating (start) since Mon 2022-02-07 12:15:05 UTC; 1min 54s ago
     Docs: https://github.com/openshift/origin
 Main PID: 104753 (hyperkube)
   Memory: 22.9M
   CGroup: /system.slice/origin-node.service
           └─104753 /usr/bin/hyperkube kubelet --v=2 --address=0.0.0.0 --allow-privileged=true --anonymous-auth=true --authentication-token-webhook=true --authentication-token-webhook-cache-t...

Feb 07 12:15:05 nod03 origin-node[104753]: I0207 12:15:05.904780  104753 server.go:418] Version: v1.11.0+d4cacc0
Feb 07 12:15:05 nod03 origin-node[104753]: I0207 12:15:05.904830  104753 feature_gate.go:230] feature gates: &{map[RotateKubeletServerCertificate:true RotateKubeletClientCertificate:true]}
Feb 07 12:15:05 nod03 origin-node[104753]: I0207 12:15:05.904910  104753 feature_gate.go:230] feature gates: &{map[RotateKubeletServerCertificate:true RotateKubeletClientCertificate:true]}
Feb 07 12:15:05 nod03 origin-node[104753]: I0207 12:15:05.905038  104753 plugins.go:97] No cloud provider specified.
Feb 07 12:15:05 nod03 origin-node[104753]: I0207 12:15:05.905054  104753 server.go:534] No cloud provider specified: "" from the config file: ""
Feb 07 12:15:05 nod03 origin-node[104753]: E0207 12:15:05.919875  104753 bootstrap.go:195] Part of the existing bootstrap client certificate is expired: 2022-02-02 07:40:00 +0000 UTC
Feb 07 12:15:05 nod03 origin-node[104753]: I0207 12:15:05.919901  104753 bootstrap.go:56] Using bootstrap kubeconfig to generate TLS client cert, key and kubeconfig file
Feb 07 12:15:05 nod03 origin-node[104753]: I0207 12:15:05.920585  104753 certificate_store.go:131] Loading cert/key pair from "/etc/origin   / node/certificates/kubelet-client-current.pem".
Feb 07 12:15:05 nod03 origin-node[104753]: I0207 12:15:05.935057  104753 csr.go:105] csr for this node already exists, reusing
Feb 07 12:15:05 nod03 origin-node[104753]: I0207 12:15:05.937670  104753 csr.go:113] csr for this node is still valid

Check also nodes status
# oc get nodes
NAME      STATUS     ROLES          AGE
mst01     Ready      infra,master   2y
nod01     Ready      compute        2y
nod02     Ready      compute        2y
nod03     NotReady   compute        1y

As we can see above the log of origin-node service has certificate issue, and there is woker node has Not Ready.
Step to fix
Go check the CSR (Certificate Signing Request) with the following command:
# oc get csr
NAME                                                   AGE       REQUESTOR                                                 CONDITION
node-csr-iu4kZ9NYHiltJkndh4k5E9kjQq7jQv6TewLOyOuYeCU   19m       system:serviceaccount:openshift-infra:node-bootstrapper   Pending

The CSR has pending state, so we have to approve it.
# oc get csr -o name | xargs oc adm certificate approve
certificatesigningrequest.certificates.k8s.io / node-csr-iu4kZ9NYHiltJkndh4k5E9kjQq7jQv6TewLOyOuYeCU approved

And check it with oc get csr again:
NAME                                                   AGE       REQUESTOR                                                 CONDITION
csr-q2r52                                              7s        system:node:nod03                                         Pending
node-csr-iu4kZ9NYHiltJkndh4k5E9kjQq7jQv6TewLOyOuYeCU   21m       system:serviceaccount:openshift-infra:node-bootstrapper   Approved,Issued

If there's pending state of CSR, just approve it like we do before with oc get csr -o name | xargs oc adm certificate approve. Then check it again to make sure if CSR has been approved.
# oc get csr
NAME                                                   AGE       REQUESTOR                                                 CONDITION
csr-q2r52                                              20s       system:node:nod03                                         Approved,Issued
node-csr-iu4kZ9NYHiltJkndh4k5E9kjQq7jQv6TewLOyOuYeCU   22m       system:serviceaccount:openshift-infra:node-bootstrapper   Approved,Issued

Rechecking nodes
Make sure origin-node service and nodes status has running.
# systemctl status origin-node
● origin-node.service - OpenShift Node
   Loaded: loaded (/etc/systemd/system/origin-node.service; enabled; vendor preset: disabled)
   Active: active (running) since Mon 2022-02-07 12:24:02 UTC; 40s ago
     Docs: https://github.com/openshift/origin
 Main PID: 572 (hyperkube)
   Memory: 133.4M
   CGroup: /system.slice/origin-node.service
           └─572 /usr/bin/hyperkube kubelet --v=2 --address=0.0.0.0 --allow-privileged=true --anonymous-auth=true --authentication-token-webhook=true --authentication-token-webhook-cache-ttl=...

# oc get nodes
NAME      STATUS     ROLES          AGE
mst01     Ready      infra,master   2y
nod01     Ready      compute        2y
nod02     Ready      compute        2y
nod03     Ready      compute        1y

If all running well, finally do check on the OKD web console side.

Leave a Reply

Please leave a comment and do not give a spam! Comments that smells of spam will be deleted without prior notice