Scaling¶
Running EKF on hundreds of nodes enables your team to efficiently run distributed ML workflows at scale. However, scaling an EKF cluster to hundreds of nodes comes with multiple challenges, such as optimization of resource usage, elimination of bottlenecks and parallelization of operations, where possible.
This document provides useful information, metrics, and conclusions about scaling an EKF cluster to hundreds of nodes on AWS.
Overview
Upper Bound¶
At the time of writing we have tested and support scaling an EKF cluster up to
300 nodes using m5d.4xlarge
instances on EKS. Scaling EKF to more than
300 nodes is a work-in-progress and is not officially supported.
Contact Arrikto
Scaling a cluster to more than 100 nodes can pose a number of Challenges. Before you attempt to scale your cluster to more than 100 nodes, plese contact the Arrikto Tech Team so we can provide all required information and support.
Below you can view a chart that shows the time needed to scale an EKF cluster to 100, 200 and 300 nodes. For our measurements, other than Rok, we deployed the following:
- A lightweight
DaemonSet
whosePods
become ready almost instantly after they start their execution. We use this as baseline to perform a side-by-side comparison with the time Rok and Rok CSI need to become ready, since these also run asDaemonSets
. - A Jupyter Kale
Deployment
with hundreds of replicas that all mount and write to a single RWX Rok volume. We use this as the main workload that triggers a cluster scale up: each Jupyter Kale replica requests enough resources so that Cluster Autoscaler adds a new node for it.
Note
Depending on your cloud environment, networking setup, container registry, and type of workload you might observe different times than the ones shown above. For example, pulling a large container image for your workload from a distant container registry will add extra overhead.
From the chart above we draw the following conclusions:
- Baseline becomes ready soon after all EKS cluster nodes become ready.
- Rok needs between 2 and 10 additional minutes to become ready, depending on cluster size.
- Rok CSI becomes ready less than 5 minutes after Rok is ready. This is expected, since Rok CSI depends on Rok.
- The workload becomes ready approximately 5 minutes after Rok CSI. Again, this is expected, since Rok CSI needs to mount the RWX Rok volume on each of the nodes.
- The workload becomes ready approximately 7 minutes after baseline for 100 nodes, 13 minutes after baseline for 200 nodes and 16 minutes after baseline for 300 nodes. This represents the initialization overhead due to all workload replicas using a single Rok RWX volume.
Challenges¶
Scaling to hundreds of nodes comes with multiple challenges, such as optimization of resource usage, elimination of bottlenecks and parallelization of operations. Below are some of the most important things that you should be aware of.
Rok etcd Volume Type¶
Inevitably, on a large-scale EKF cluster that runs hundreds of Rok instances,
the traffic from and to Rok etcd becomes significantly higher. In this context,
for clusters larger than 100 nodes, we highly recommend that you switch from a
gp2
to a io1
volume for Rok etcd, since io1
volumes are designed
for I/O-intensive workloads that are sensitive to storage consistency.
This is to maintain a high quality of service and improve overall stability.
See also
- Follow Increase Rok etcd EBS Volume IOPS to learn how to convert your existing Rok
etcd volume from
gp2
toio1
.
Rok Readiness¶
When scaling your EKF cluster to hundreds of nodes, Rok Pods will be created automatically on all eligible nodes. For each eligible node that becomes ready, Rok Operator adds exactly one new member to the Rok cluster. Since multiple cluster nodes become ready simultaneously, many Rok Pods start their initialization process and join the cluster in parallel.
This is known to cause the readiness probe of existing Rok pods to temporarily
fail. Therefore, it is expected that you see Rok Pods transition from Ready
to NotReady
during scale up, until Rok and Rok CSI eventually become ready.
Resource Quotas¶
Depending on your AWS billing account, different limits on cloud resources might apply for your cluster. Therefore, before scaling up your EKF cluster to hundreds of nodes, you need to ensure that your account limits are above the sum of resources that you need to spawn new nodes.
After requesting to scale your EKF cluster to hundreds of nodes on AWS,
your node group might be in Degraded
state due to the following
error:
In this case, contact AWS support and request to increase the instance limits for your account. For more information on how to calculate the correct instance limits for your needs and request them from AWS see the official user guide for On-Demand Instance limits.
Cloud Provider Capacity¶
Depending on your region and availability zone, the capacity of your cloud provider might vary. Therefore, before scaling up your EKF cluster to hundreds of nodes, you need to ensure that your cloud provider has enough available resources to spawn new Kubernetes nodes for your cluster.
After requesting to scale your EKF cluster to hundreds of nodes on AWS,
your node group might be in Degraded
state due to the following
error:
In this case you can either:
- Wait until AWS provisions additional capacity for your EKS cluster.
- Contact AWS support to serve your request.
- Add an additional node group to your cluster in a different Availability Zone.
After requesting to scale your EKF cluster to hundreds of nodes on AWS,
your node group might be in Degraded
state due to the following
error:
In this case you need to free up enough addresses in the specified subnet, for example, by deleting EKS cluster node groups or EC2 instances you no longer need.
Docker Hub Rate Limiting¶
As stated in the official rate limiting documentation, Docker Hub limits the number of Docker image pulls based on the account type of the user pulling the image. Pull rates limits are based on individual IP address:
- For anonymous users, the rate limit is set to 100 pulls per 6 hours per IP address.
- For authenticated users, the rate limit is set to 200 pulls per 6 hours.
- For users with a paid Docker subscription no rate limit applies.
By default, Istio (the service mesh of EKF) specifies container images hosted on Docker Hub for its proxy sidecar. Therefore, before scaling up your EKF cluster to hundreds of nodes, you need to ensure that the pull limits of the Docker Hub account you are using cover your needs.
When runnning workloads in an Istio-protected Kubernetes namespace on a
multi hundred node EKF cluster, it could be the case that you exceed the
maximum number of allowed pulls, which results in Pods getting in
ImagePullBackOff
phase due to the following error:
In this case, you can either
- use a private container registry to host the Istio proxy container image and set up cluster-wide access to it or
- upgrade to a paid Docker subscription so that you do not get limited.