I just checked OKD cluster in our development environment, I saw that OKD worker has node lost status.
Root cause analysisFirst, we can check service status.
# systemctl status origin-node
And here's the output
● origin-node.service - OpenShift Node
Loaded: loaded (/etc/systemd/system/origin-node.service; enabled; vendor preset: disabled)
Active: activating (start) since Mon 2022-02-07 12:15:05 UTC; 1min 54s ago
Docs: https://github.com/openshift/origin
Main PID: 104753 (hyperkube)
Memory: 22.9M
CGroup: /system.slice/origin-node.service
└─104753 /usr/bin/hyperkube kubelet --v=2 --address=0.0.0.0 --allow-privileged=true --anonymous-auth=true --authentication-token-webhook=true --authentication-token-webhook-cache-t...
Feb 07 12:15:05 nod03 origin-node[104753]: I0207 12:15:05.904780 104753 server.go:418] Version: v1.11.0+d4cacc0
Feb 07 12:15:05 nod03 origin-node[104753]: I0207 12:15:05.904830 104753 feature_gate.go:230] feature gates: &{map[RotateKubeletServerCertificate:true RotateKubeletClientCertificate:true]}
Feb 07 12:15:05 nod03 origin-node[104753]: I0207 12:15:05.904910 104753 feature_gate.go:230] feature gates: &{map[RotateKubeletServerCertificate:true RotateKubeletClientCertificate:true]}
Feb 07 12:15:05 nod03 origin-node[104753]: I0207 12:15:05.905038 104753 plugins.go:97] No cloud provider specified.
Feb 07 12:15:05 nod03 origin-node[104753]: I0207 12:15:05.905054 104753 server.go:534] No cloud provider specified: "" from the config file: ""
Feb 07 12:15:05 nod03 origin-node[104753]: E0207 12:15:05.919875 104753 bootstrap.go:195] Part of the existing bootstrap client certificate is expired: 2022-02-02 07:40:00 +0000 UTC
Feb 07 12:15:05 nod03 origin-node[104753]: I0207 12:15:05.919901 104753 bootstrap.go:56] Using bootstrap kubeconfig to generate TLS client cert, key and kubeconfig file
Feb 07 12:15:05 nod03 origin-node[104753]: I0207 12:15:05.920585 104753 certificate_store.go:131] Loading cert/key pair from "/etc/origin / node/certificates/kubelet-client-current.pem".
Feb 07 12:15:05 nod03 origin-node[104753]: I0207 12:15:05.935057 104753 csr.go:105] csr for this node already exists, reusing
Feb 07 12:15:05 nod03 origin-node[104753]: I0207 12:15:05.937670 104753 csr.go:113] csr for this node is still valid
Check also nodes status
# oc get nodes
NAME STATUS ROLES AGE
mst01 Ready infra,master 2y
nod01 Ready compute 2y
nod02 Ready compute 2y
nod03 NotReady compute 1y
As we can see above the log of origin-node service has certificate issue, and there is woker node has Not Ready.
Step to fix
Go check the CSR (Certificate Signing Request) with the following command:
# oc get csr
NAME AGE REQUESTOR CONDITION
node-csr-iu4kZ9NYHiltJkndh4k5E9kjQq7jQv6TewLOyOuYeCU 19m system:serviceaccount:openshift-infra:node-bootstrapper Pending
The CSR has pending state, so we have to approve it.
# oc get csr -o name | xargs oc adm certificate approve
certificatesigningrequest.certificates.k8s.io / node-csr-iu4kZ9NYHiltJkndh4k5E9kjQq7jQv6TewLOyOuYeCU approved
And check it with oc get csr again:
NAME AGE REQUESTOR CONDITION
csr-q2r52 7s system:node:nod03 Pending
node-csr-iu4kZ9NYHiltJkndh4k5E9kjQq7jQv6TewLOyOuYeCU 21m system:serviceaccount:openshift-infra:node-bootstrapper Approved,Issued
If there's pending state of CSR, just approve it like we do before with oc get csr -o name | xargs oc adm certificate approve. Then check it again to make sure if CSR has been approved.
# oc get csr
NAME AGE REQUESTOR CONDITION
csr-q2r52 20s system:node:nod03 Approved,Issued
node-csr-iu4kZ9NYHiltJkndh4k5E9kjQq7jQv6TewLOyOuYeCU 22m system:serviceaccount:openshift-infra:node-bootstrapper Approved,Issued
Rechecking nodes
Make sure origin-node service and nodes status has running.
# systemctl status origin-node
● origin-node.service - OpenShift Node
Loaded: loaded (/etc/systemd/system/origin-node.service; enabled; vendor preset: disabled)
Active: active (running) since Mon 2022-02-07 12:24:02 UTC; 40s ago
Docs: https://github.com/openshift/origin
Main PID: 572 (hyperkube)
Memory: 133.4M
CGroup: /system.slice/origin-node.service
└─572 /usr/bin/hyperkube kubelet --v=2 --address=0.0.0.0 --allow-privileged=true --anonymous-auth=true --authentication-token-webhook=true --authentication-token-webhook-cache-ttl=...
# oc get nodes
NAME STATUS ROLES AGE
mst01 Ready infra,master 2y
nod01 Ready compute 2y
nod02 Ready compute 2y
nod03 Ready compute 1y
If all running well, finally do check on the OKD web console side.