原来集群中一个master节点操作系统坏掉了,所以需要重装操作系统,然后以master身份加入到原有集群。
安装好docker和k8s以后,使用kubeadm Join命令加入集群一直报错,错误信息如下:
The reset process does not clean your kubeconfig files and you must remove them manually. Please, check the contents of the $HOME/.kube/config file. root@master03:/etc/kubernetes/manifests# kubeadm join <ip:port> --token 7gmpm8.yxxahvalc8exemrg --discovery-token-ca-cert-hash sha256:504ee4ac923aed4650dfff7b8e8fc74025274c59d03ec3bc19eb36f9fef096fd --control-plane --certificate-key 667d9d4b82431911ff6f8505d68d5cde9e05182355b01e8ea25a0bd9d7201691 [preflight] Running pre-flight checks [WARNING SystemVerification]: this Docker version is not on the list of validated versions: 19.03.5. Latest validated version: 18.09 [preflight] Reading configuration from the cluster... [preflight] FYI: You can look at this config file with 'kubectl -n kube-system get cm kubeadm-config -oyaml' [preflight] Running pre-flight checks before initializing the new control plane instance [preflight] Pulling images required for setting up a Kubernetes cluster [preflight] This might take a minute or two, depending on the speed of your internet connection [preflight] You can also perform this action in beforehand using 'kubeadm config images pull' [download-certs] Downloading the certificates in Secret "kubeadm-certs" in the "kube-system" Namespace [certs] Using certificateDir folder "/etc/kubernetes/pki" [certs] Generating "etcd/healthcheck-client" certificate and key [certs] Generating "apiserver-etcd-client" certificate and key [certs] Generating "etcd/server" certificate and key [certs] etcd/server serving cert is signed for DNS names [master03 localhost] and IPs [<ip> 127.0.0.1 ::1] [certs] Generating "etcd/peer" certificate and key [certs] etcd/peer serving cert is signed for DNS names [master03 localhost] and IPs [<ip> 127.0.0.1 ::1] [certs] Generating "apiserver" certificate and key [certs] apiserver serving cert is signed for DNS names [master03 kubernetes kubernetes.default kubernetes.default.svc kubernetes.default.svc.cluster.local kube.iiis.co] and IPs [10.96.0.1 <ip>] [certs] Generating "apiserver-kubelet-client" certificate and key [certs] Generating "front-proxy-client" certificate and key [certs] Valid certificates and keys now exist in "/etc/kubernetes/pki" [certs] Using the existing "sa" key [kubeconfig] Generating kubeconfig files [kubeconfig] Using kubeconfig folder "/etc/kubernetes" [endpoint] WARNING: port specified in controlPlaneEndpoint overrides bindPort in the controlplane address [kubeconfig] Writing "admin.conf" kubeconfig file [kubeconfig] Writing "controller-manager.conf" kubeconfig file [kubeconfig] Writing "scheduler.conf" kubeconfig file [control-plane] Using manifest folder "/etc/kubernetes/manifests" [control-plane] Creating static Pod manifest for "kube-apiserver" [control-plane] Creating static Pod manifest for "kube-controller-manager" [control-plane] Creating static Pod manifest for "kube-scheduler" [check-etcd] Checking that the etcd cluster is healthy error execution phase check-etcd: etcd cluster is not healthy: failed to dial endpoint https://<ip>:2379 with maintenance client: context deadline exceeded To see the stack trace of this error execute with --v=5 or higher
经过检查原来是master03的旧etcd数据问题,需要在etcd中删除这个成员,然后重新加入这个master集群。
进入任意一个etcd集群的POD内
kubectl exec -it etcd-k8s-master01 sh -n kube-system export ETCDCTL_API=3 #查看成员列表 etcdctl --cacert="/etc/kubernetes/pki/etcd/ca.crt" --cert="/etc/kubernetes/pki/etcd/server.crt" --key="/etc/kubernetes/pki/etcd/server.key" member list #记录上述最左侧的成员ID,然后删除 etcdctl --cacert="/etc/kubernetes/pki/etcd/ca.crt" --cert="/etc/kubernetes/pki/etcd/server.crt" --key="/etc/kubernetes/pki/etcd/server.key" member remove <member_id> #删除成功以后在查看下成员列表是否已经清除成功
退出etcd容器以后,我们再去新节点reset下,然后重新加入集群
kubeadm reset -f kubeadm join <ip:port> --token 7gmpm8.yxxahvalc8exemrg --discovery-token-ca-cert-hash sha256:504ee4ac923aed4650dfff7b8e8fc74025274c59d03ec3bc19eb36f9fef096fd --control-plane --certificate-key 667d9d4b82431911ff6f8505d68d5cde9e05182355b01e8ea25a0bd9d7201691
发现已经可以正常添加master节点了。