Scale Up Rok etcd¶
This guide will walk you through increasing the size of your Rok etcd cluster by introducing one additional member. For adding more members, simply follow the guide again.
Important
To withstand a node failure, use a cluster with at least three members. We highly recommend you use an odd number of members and no more than seven members.
See also
- Official guide on etcd clustering.
- etcd learner design.
What You’ll Need¶
- A configured management environment.
- Your clone of the Arrikto GitOps repository.
- An existing Kubernetes cluster.
- An existing Rok deployment.
Check Your Environment¶
Retrieve the endpoints of all etcd cluster members:
root@rok-tools:~/ops/deployments# export ETCD_ENDPOINTS=$(kubectl \ > exec -ti -n rok sts/rok-etcd -- etcdctl member list -w json \ > | jq -r '.members[].clientURLs[]' | paste -sd, -)Ensure that the etcd cluster is currently healthy. Inspect the etcd endpoints and verify that the HEALTH field is true for all endpoints:
root@rok-tools:~/# kubectl exec -ti -n rok sts/rok-etcd -c etcd -- \ > etcdctl --endpoints ${ETCD_ENDPOINTS?} endpoint health -w table +--------------------------------------+--------+------------+-------+ | ENDPOINT | HEALTH | TOOK | ERROR | +--------------------------------------+--------+------------+-------+ | rok-etcd-0.rok-etcd-cluster.rok:2379 | true | 9.302141ms | | | rok-etcd-1.rok-etcd-cluster.rok:2379 | true | 9.302141ms | | +--------------------------------------+--------+------------+-------+
Procedure¶
Go to your GitOps repository, inside your
rok-tools
management environment:root@rok-tools:~# cd ~/ops/deploymentsRetrieve the current size of the etcd cluster:
root@rok-tools:~/ops/deployments# export ETCD_CLUSTER_SIZE=$(kubectl get sts \ > -n rok rok-etcd -o jsonpath="{.spec.replicas}") \ > && echo ${ETCD_CLUSTER_SIZE?} 2Set the name of the new etcd member:
root@rok-tools:~/ops/deployments# export \ > NAME=rok-etcd-${ETCD_CLUSTER_SIZE?}.rok-etcd-cluster.rokSet the URL of the new etcd member:
root@rok-tools:~/ops/deployments# export \ > PEER_URL=http://rok-etcd-${ETCD_CLUSTER_SIZE?}.rok-etcd-cluster.rok:2380Add a new member to the etcd cluster:
root@rok-tools:~/ops/deployments# kubectl exec -ti -n rok sts/rok-etcd -c etcd -- \ > etcdctl member add --learner ${NAME?} --peer-urls ${PEER_URL?} Member 49a1544e41ae84e4 added to cluster 844c2991de84c0b ETCD_NAME="rok-etcd-2.rok-etcd-cluster.rok" ETCD_INITIAL_CLUSTER="rok-etcd-2.rok-etcd-cluster.rok=http://rok-etcd-2.rok-etcd-cluster.rok:2380,rok-etcd-1.rok-etcd-cluster.rok=http://rok-etcd-1.rok-etcd-cluster.rok:2380" ETCD_INITIAL_ADVERTISE_PEER_URLS="http://rok-etcd-2.rok-etcd-cluster.rok:2380" ETCD_INITIAL_CLUSTER_STATE="existing"Troubleshooting
Error: etcdserver: unhealthy cluster
There are cases, mostly due to a network hiccup, where an existing member rejoins the cluster, for example after a Pod restart, and other members end up considering it inactive. In such a case, member add fails with:
{"level":"warn","ts":"2022-09-23T09:52:00.805Z","logger":"etcd-client","caller":"v3/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"etcd-endpoints://0xc000458a80/127.0.0.1:2379","attempt":0,"error":"rpc error: code = Unavailable desc = etcdserver: unhealthy cluster"} Error: etcdserver: unhealthy clusterAt the same time, the etcd cluster remains operational and clients are able to access it and make read/write requests.
To recover, follow the steps below:
Retrieve the endpoints of all etcd cluster members:
root@rok-tools:~/ops/deployments# export ETCD_ENDPOINTS=$(kubectl \ > exec -ti -n rok sts/rok-etcd -- etcdctl member list -w json \ > | jq -r '.members[].clientURLs[]' | paste -sd, -)Ensure that the etcd cluster is currently healthy. Inspect the etcd endpoints and verify that the HEALTH field is true for all endpoints:
root@rok-tools:~/# kubectl exec -ti -n rok sts/rok-etcd -c etcd -- \ > etcdctl --endpoints ${ETCD_ENDPOINTS?} endpoint health -w table +--------------------------------------+--------+------------+-------+ | ENDPOINT | HEALTH | TOOK | ERROR | +--------------------------------------+--------+------------+-------+ | rok-etcd-0.rok-etcd-cluster.rok:2379 | true | 9.302141ms | | | rok-etcd-1.rok-etcd-cluster.rok:2379 | true | 9.325642ms | | +--------------------------------------+--------+------------+-------+Restart the etcd Pods:
root@rok-tools:~/# kubectl delete pods -n rok -l app=etcdRerun the command to add a member to the etcd cluster.
Increase the etcd cluster size:
root@rok-tools:~/ops/deployments# let ETCD_CLUSTER_SIZE++Render the patch for the cluster size:
root@rok-tools:~/ops/deployments# j2 \ > rok/rok-external-services/etcd/overlays/deploy/patches/cluster-size.yaml.j2 \ > -o rok/rok-external-services/etcd/overlays/deploy/patches/cluster-size.yamlSet the cluster state:
root@rok-tools:~/ops/deployments# export ETCD_CLUSTER_STATE=existingRender the patch for the cluster state:
root@rok-tools:~/ops/deployments# j2 \ > rok/rok-external-services/etcd/overlays/deploy/patches/cluster-state.yaml.j2 \ > -o rok/rok-external-services/etcd/overlays/deploy/patches/cluster-state.yamlEdit
rok/rok-external-services/etcd/overlays/deploy/kustomization.yaml
and ensure that bothcluster-size
andcluster-state
patches are enabled:patches: - path: patches/cluster-size.yaml target: kind: StatefulSet name: etcd - path: patches/cluster-state.yamlCommit your changes:
root@rok-tools:~/ops/deployments# git commit -am "Scale Rok etcd to ${ETCD_CLUSTER_SIZE?} members"Apply the kustomization:
root@rok-tools:~/ops/deployments# rok-deploy --apply rok/rok-external-services/etcd/overlays/deployWait for a few minutes to give the new member a chance to join the cluster and retrieve its member ID. Ensure the following command outputs SUCCESS:
root@rok-tools:~/ops/deployments# export ID=$(kubectl \ > exec -ti -n rok sts/rok-etcd -c etcd -- \ > etcdctl member list -w json --hex \ > | jq -r '.members[] | select(.name == "'${NAME?}'") | .ID') \ > && [[ -z "${ID?}" ]] && echo ERROR || echo SUCCESS SUCCESSTroubleshooting
The command output is ERROR
If the new member has not yet managed to join the cluster, then its name will be empty and the above command will output ERROR. In this case, wait for a few minutes to allow the new member to start and join the cluster, and try again.
Promote the new member to a voting member:
root@rok-tools:~/ops/deployments# kubectl exec -ti -n rok sts/rok-etcd -c etcd -- \ > etcdctl member promote ${ID?} Member 49a1544e41ae84e4 promoted in cluster 4c194b295a903d33Troubleshooting
The member is not in sync with the leader
If the above command fails with the following error:
Error: etcdserver: can only promote a learner member which is in sync with leaderit means that you try to promote the new member before it has managed to catch up with the cluster. In this case, wait for a few more minutes and try again.
Verify¶
Ensure that all Rok etcd Pods are ready. Verify that field READY is 2/2 and field STATUS is Running for all Pods:
root@rok-tools:~/ops/deployments# kubectl get pods -n rok -l app=etcd NAME READY STATUS RESTARTS AGE rok-etcd-0 2/2 Running 0 2d22h rok-etcd-1 2/2 Running 0 2d22h rok-etcd-2 2/2 Running 0 2d22hRetrieve the endpoints of all etcd cluster members:
root@rok-tools:~/ops/deployments# export ETCD_ENDPOINTS=$(kubectl \ > exec -ti -n rok sts/rok-etcd -- etcdctl member list -w json \ > | jq -r '.members[].clientURLs[]' | paste -sd, -)Ensure that the etcd cluster is currently healthy. Inspect the etcd endpoints and verify that the HEALTH field is true for all endpoints:
root@rok-tools:~/# kubectl exec -ti -n rok sts/rok-etcd -c etcd -- \ > etcdctl --endpoints ${ETCD_ENDPOINTS?} endpoint health -w table +--------------------------------------+--------+------------+-------+ | ENDPOINT | HEALTH | TOOK | ERROR | +--------------------------------------+--------+------------+-------+ | rok-etcd-0.rok-etcd-cluster.rok:2379 | true | 9.302141ms | | | rok-etcd-1.rok-etcd-cluster.rok:2379 | true | 9.325642ms | | | rok-etcd-2.rok-etcd-cluster.rok:2379 | true | 9.325642ms | | +--------------------------------------+--------+------------+-------+Ensure that the Rok etcd cluster has the expected member count. Verify that the output of the following command is for example 2:
root@rok-tools:~# kubectl exec -ti -n rok sts/rok-etcd -c etcd -- \ > etcdctl member list | wc -l 2List the members of the etcd cluster. Verify that field STATUS is started and field IS LEARNER is false for all members:
root@rok-tools:~/ops/deployments# kubectl exec -ti -n rok sts/rok-etcd -c etcd -- \ > etcdctl member list -w table +------------------+---------+---------------------------------+---------------------------------------------+---------------------------------------------+------------+ | ID | STATUS | NAME | PEER ADDRS | CLIENT ADDRS | IS LEARNER | +------------------+---------+---------------------------------+---------------------------------------------+---------------------------------------------+------------+ | b2ff88bb2eae13b7 | started | rok-etcd-0.rok-etcd-cluster.rok | http://rok-etcd-0.rok-etcd-cluster.rok:2380 | http://rok-etcd-0.rok-etcd-cluster.rok:2379 | false | | f823900dacf44825 | started | rok-etcd-1.rok-etcd-cluster.rok | http://rok-etcd-1.rok-etcd-cluster.rok:2380 | http://rok-etcd-1.rok-etcd-cluster.rok:2379 | false | | 49a1544e41ae84e4 | started | rok-etcd-2.rok-etcd-cluster.rok | http://rok-etcd-2.rok-etcd-cluster.rok:2380 | http://rok-etcd-2.rok-etcd-cluster.rok:2379 | false | +------------------+---------+---------------------------------+---------------------------------------------+---------------------------------------------+------------+
What’s Next¶
Check out the rest of the maintenance operations you can perform on your Rok etcd cluster.