Scale In EKS Cluster¶

EKF supports automatic scaling operations on the Kubernetes cluster using a modified version of the Cluster Autoscaler that supports Rok volumes.

This guide will walk you through manually scaling in your EKS cluster, by selecting and removing nodes one-by-one.

What You’ll need ¶

A configured management environment.
An existing EKS cluster.
One or more managed or self-managed node groups.
Optional

A working Cluster Autoscaler.

Procedure ¶

List the Kubernetes nodes of your cluster:

root@rok-tools:~# kubectl get nodes NAME STATUS ROLES AGE VERSION ip-192-168-147-191.eu-central-1.compute.internal Ready <none> 18d v1.23.13-eks-ba74326 ip-192-168-168-207.eu-central-1.compute.internal Ready <none> 18d v1.23.13-eks-ba74326
Specify the node you want to remove:

root@rok-tools:~# export NODE=<NODE>

Replace <NODE> with the node name. For example:

root@rok-tools:~# export NODE=ip-192-168-168-207.eu-central-1.compute.internal
Note

Normally, the Cluster Autoscaler finds a scale-in candidate automatically. In order to find a good candidate manually, you have to
1. Pick an underutilized node.
2. Ensure that you don’t try to scale in past the ASG’s minSize.
3. Ensure that existing EBS volumes are reachable from other nodes in the cluster.
Start a drain operation for the selected node:

root@rok-tools:~# kubectl drain --ignore-daemonsets --delete-local-data ${NODE?} ... node/ip-192-168-168-207.eu-central-1.compute.internal evicted

Note

This may take a while, since Rok is unpinning all volumes on this node, and as such, rok-csi-guard pods are expected to be evicted last.

Warning

Do not delete rok-csi-guard pods manually, since this might cause data loss.

Troubleshooting

The command does not complete.

Most likely the unpinning of a Rok PVC fails. Inspect the logs of Rok CSI controller to debug further.
Delete the master Rok Pod if it runs on the selected node:

Note

The Rok master Pod, which runs as part of the Rok DaemonSet, has the cluster-autoscaler.kubernetes.io/safe-to-evict: false annotation, which prevents the Cluster Autoscaler from removing the node. To allow the Cluster Autoscaler to remove the node, you need to delete this Pod so that another Rok Pod gets elected as master.
1. Delete the master Rok Pod:
  
  root@rok-tools:~# kubectl get pods -n rok -l app=rok,role=master \ > --field-selector spec.nodeName==${NODE?} -ojson \ > | jq -r '.items[] | .metadata.name' \ > | xargs -r kubectl delete pod -n rok pod "rok-c2fj5" deleted
  
  If the previous command does not produce any output, it is normal and indicates that the Rok master Pod does not run on the selected node.
2. Wait until a Rok Pod has been elected as master:
  
  root@rok-tools:~# watch kubectl get pods -n rok -l app=rok,role=master Every 2.0s: kubectl get pods -n rok -l app=rok,role=master rok-tools: Tue Mar 7 14:19:35 2023 NAME READY STATUS RESTARTS AGE rok-ghb9q 2/2 Running 0 1m
3. Ensure the new Rok Pod that is created on the selected node has not been elected as master:
  
  root@rok-tools:~# kubectl get pods -n rok -l app=rok,role=master \ > --field-selector spec.nodeName==${NODE?} -ojson \ > | jq -e '.items == []' >/dev/null && echo OK || echo FAIL OK
  
  Troubleshooting
  
  The output of the command is FAIL.
  
  The new Rok Pod of the selected node has been elected as master again.
  
  Repeat the previous step, i.e., delete the Rok Pod of the selected node again, to trigger a new election.
Once the drain operation completes, remove the node.

Fast Forward

Skip this step if you have a Cluster Autoscaler instance running in your cluster, since it will see the drained node, will consider it as unneeded, and after a period of time (based on scale-down-unneeded-time option) it will automatically terminate the EC2 instance and decrement the desired size of the Auto Scaling group.
1. Find the EC2 instance of the drained node:
  
  root@rok-tools:~# export INSTANCE=$(kubectl get nodes ${NODE?} \ > -o jsonpath={.spec.providerID} \ > | sed 's|aws:///.*/||')
2. Terminate the instance and decrement the desired capacity of its Auto Scaling group:
  
  root@rok-tools:~# aws autoscaling terminate-instance-in-auto-scaling-group \ > --instance-id ${INSTANCE?} \ > --should-decrement-desired-capacity

Verify ¶

Ensure that the selected node has been removed from your Kubernetes cluster:

root@rok-tools:~# kubectl get nodes ${NODE?} Error from server (NotFound): nodes "ip-192-168-168-207.eu-central-1.compute.internal" not found
Ensure that the underlying instance has been deleted:

root@rok-tools:~# aws ec2 describe-instances --instance-id ${INSTANCE?} An error occurred (InvalidInstanceID.NotFound) when calling the DescribeInstances operation: The instance ID 'i-0f992f0b02d777901' does not exist