Disable Unsafe Operations for Your EKS Cluster¶
Since Rok uses local NVMe disks to store user data, terminating/replacing a node before properly draining it would result in data loss. For example, scaling down the node group directly via the Auto Scaling group will eventually remove the instance after about 15 minutes, regardless of whether a drain operation takes place.
This guide mentions all the necessary actions that you, as the administrator, should do in order to protect your EKS cluster from losing any data. Specifically, it will will walk you through
- enabling scale-in protection.
- suspending Auto Scaling processes that would result in a node termination.
Warning
If an EC2 instance (EKS worker node) terminates in an unexpected manner, you will lose data. As such, you should avoid the following actions:
- Decrement the desired size of the Auto Scaling group.
- Terminate an EC2 instance directly from the console.
- Delete a whole node group.
Fast Forward
If you have already disabled unsafe operations in your cluster, expand this box to fast-forward.
- Proceed to the Verify section.
Overview
What You’ll need¶
- A configured management environment.
- An existing EKS cluster.
Procedure¶
Go to your GitOps repository, inside your
rok-tools
management environment:root@rok-tools:~# cd ~/ops/deploymentsRestore the required context from previous sections:
root@rok-tools:~/ops/deployments# source deploy/env.eks-clusterroot@rok-tools:~/ops/deployments# export EKS_CLUSTERList the Auto Scaling groups of your EKS cluster:
root@rok-tools:~/ops/deployments# aws autoscaling describe-auto-scaling-groups \ > --filters Name=tag-key,Values=kubernetes.io/cluster/${EKS_CLUSTER?} \ > --query AutoScalingGroups[].[AutoScalingGroupName] \ > --output text eks-a6be5e9a-d296-09a7-6e7d-e5bc39d9f00b eks-aebc1fd1-3b78-8761-606e-ca8502549661Repeat steps below for each one of the Auto Scaling groups in the list shown above.
Specify the ASG to operate on:
root@rok-tools:~/ops/deployments# export ASG=<ASG>Replace
<ASG>
with the name of your Auto Scaling group. For example:root@rok-tools:~/ops/deployments# export ASG=eks-a6be5e9a-d296-09a7-6e7d-e5bc39d9f00bEnable scale-in protection at the ASG level for new instances:
root@rok-tools:~/ops/deployments# aws autoscaling update-auto-scaling-group \ > --auto-scaling-group-name ${ASG?} \ > --new-instances-protected-from-scale-inEnable scale-in protection at the instance level for existing instances:
root@rok-tools:~/ops/deployments# aws autoscaling describe-auto-scaling-groups \ > --auto-scaling-group-name ${ASG?} \ > --query AutoScalingGroups[].Instances[].InstanceId \ > --output text \ > | xargs aws autoscaling set-instance-protection \ > --auto-scaling-group-name ${ASG} \ > --protected-from-scale-in \ > --instance-idsSuspend any unsafe Auto Scaling processes:
root@rok-tools:~/ops/deployments# aws autoscaling suspend-processes \ > --auto-scaling-group-name ${ASG?} \ > --scaling-processes ReplaceUnhealthy AZRebalance InstanceRefreshNote
The Auto Scaling processes above are considered unsafe since they may cause ungraceful node termination. Specifically:
- ReplaceUnhealthy will automatically replace EC2 instances that their status checks have failed. Now, unhealthy EC2 instances will remain in-service and will now require a manual action. See the Manage Unhealty Instances guide for more information.
- AZRebalance will automatically rebalance your instances across existing availability zones. Since EKF uses EBS volumes, we recommend using node groups that span a single AZ, so this should not make a difference.
- InstanceRefresh will perform a rolling replacement of all or some instances in your Auto Scaling group. This might be useful when you want to update the launch template configuration of the node group. However, we do not recommend using this feature for upgrades. Follow the Upgrade EKS Node Group guide instead.
Go back to step a and repeat this process for the remaining Auto Scaling groups.
Verify¶
Go to your GitOps repository, inside your
rok-tools
management environment:root@rok-tools:~# cd ~/ops/deploymentsRestore the required context from previous sections:
root@rok-tools:~/ops/deployments# source deploy/env.eks-clusterroot@rok-tools:~/ops/deployments# export EKS_CLUSTEREnsure that all Auto Scaling groups associated with your cluster have scale-in protection enabled:
root@rok-tools:~/ops/deployments# aws autoscaling describe-auto-scaling-groups \ > --filters Name=tag-key,Values=kubernetes.io/cluster/${EKS_CLUSTER?} \ > --query AutoScalingGroups[].[AutoScalingGroupName,NewInstancesProtectedFromScaleIn] \ > --output text eks-a6be5e9a-d296-09a7-6e7d-e5bc39d9f00b True eks-aebc1fd1-3b78-8761-606e-ca8502549661 TrueEnsure that all running Auto Scaling instances of your cluster have scale-in protection enabled:
root@rok-tools:~/ops/deployments# aws autoscaling describe-auto-scaling-groups \ > --filters Name=tag-key,Values=kubernetes.io/cluster/${EKS_CLUSTER?} \ > --query AutoScalingGroups[].Instances[].[InstanceId,ProtectedFromScaleIn] \ > --output text i-03696c6a5abe28646 True i-07898559e258823c8 TrueEnsure that you have suspended the ReplaceUnhealthy, AZRebalance, and InstanceRefresh Auto Scaling processes for all Auto Scaling groups of your EKS cluster:
root@rok-tools:~/ops/deployments# aws autoscaling describe-auto-scaling-groups \ > --filters Name=tag-key,Values=kubernetes.io/cluster/${EKS_CLUSTER?} \ > --query AutoScalingGroups[].[AutoScalingGroupName,SuspendedProcesses[].ProcessName] \ > --output text | paste - - | column -t eks-a6be5e9a-d296-09a7-6e7d-e5bc39d9f00b ReplaceUnhealthy AZRebalance InstanceRefresh eks-aebc1fd1-3b78-8761-606e-ca8502549661 ReplaceUnhealthy AZRebalance InstanceRefresh