Manage Unhealty Instances¶
If you have followed guide Disable Unsafe Operations for Your EKS Cluster, it means you have suspended the ReplaceUnhealthy operation. Therefore, even if the ASG marks an instance as unhealthy, it will remain in service and will require a manual action. This guide will walk you through the manual actions required based on whether the failure was temporary or permanent.
See also
Overview
What You’ll need¶
- A configured management environment.
- An existing EKS cluster.
Procedure¶
Go to your GitOps repository, inside your
rok-tools
management environment:root@rok-tools:~# cd ~/ops/deploymentsRestore the required context from previous sections:
root@rok-tools:~/ops/deployments# source deploy/env.eks-clusterroot@rok-tools:~/ops/deployments# export EKS_CLUSTERInspect the health status of your instances:
root@rok-tools:~/ops/deployments# aws autoscaling describe-auto-scaling-groups \ > --filters Name=tag-key,Values=kubernetes.io/cluster/${EKS_CLUSTER?} \ > --query AutoScalingGroups[].Instances[].[InstanceId,LifecycleState,HealthStatus] \ > --output text i-03696c6a5abe28646 InService Healthy i-07898559e258823c8 InService UnhealthySpecify the instance to operate on:
root@rok-tools:~/ops/deployments# export INSTANCE=<INSTANCE>Replace
<INSTANCE>
with the instance ID. For example:root@rok-tools:~/ops/deployments# export INSTANCE=i-07898559e258823c8Choose one of the following options based on whether the failure was temporary or permanent.
In case of a temporary failure, for example, a system crash that made the system freeze but eventually the node rebooted, the EC2 instance can be considered healthy again and EC2 will report it as such. Reset the health status of your instance manually:
root@rok-tools:~/ops/deployments# aws autoscaling set-instance-health \ > --health-status Healthy \ > --instance-id ${INSTANCE?}In case of a permanent failure, for example, a corrupted file system, the node must be replaced. Terminate your instance:
root@rok-tools:~/ops/deployments# aws autoscaling terminate-instance-in-auto-scaling-group \ > --no-should-decrement-desired-capacity \ > --instance-id ${INSTANCE?}The ASG will then see that the desired capacity is greater than the actual size and will create a new instance.
Important
In such cases, you will lose data. If you have set up Snapshot policies for Backup then you will be able to go back in time and restore your volumes from the latest available snapshot.
Verify¶
Ensure that all instances associated with your cluster are InService and Healthy:
root@rok-tools:~/ops/deployments# aws autoscaling describe-auto-scaling-groups \ > --filters Name=tag-key,Values=kubernetes.io/cluster/${EKS_CLUSTER?} \ > --query AutoScalingGroups[].Instances[].[InstanceId,LifecycleState,HealthStatus] \ > --output text i-03696c6a5abe28646 InService Healthy i-07898559e258823c8 InService Healthy
What’s Next¶
Check out the rest of the EKS maintenance operations that you can perform on your cluster.