Scaling¶

Running EKF on hundreds of nodes enables your team to efficiently run distributed ML workflows at scale. However, scaling an EKF cluster to hundreds of nodes comes with multiple challenges, such as optimization of resource usage, elimination of bottlenecks and parallelization of operations, where possible.

This document provides useful information, metrics, and conclusions about scaling an EKF cluster to hundreds of nodes on AWS.

Overview

Upper Bound
Challenges

Upper Bound ¶

At the time of writing we have tested and support scaling an EKF cluster up to 300 nodes using m5d.4xlarge instances on EKS. Scaling EKF to more than 300 nodes is a work-in-progress and is not officially supported.

Contact Arrikto

Scaling a cluster to more than 100 nodes can pose a number of Challenges. Before you attempt to scale your cluster to more than 100 nodes, plese contact the Arrikto Tech Team so we can provide all required information and support.

Below you can view a chart that shows the time needed to scale an EKF cluster to 100, 200 and 300 nodes. For our measurements, other than Rok, we deployed the following:

A lightweight DaemonSet whose Pods become ready almost instantly after they start their execution. We use this as baseline to perform a side-by-side comparison with the time Rok and Rok CSI need to become ready, since these also run as DaemonSets.
A Jupyter Kale Deployment with hundreds of replicas that all mount and write to a single RWX Rok volume. We use this as the main workload that triggers a cluster scale up: each Jupyter Kale replica requests enough resources so that Cluster Autoscaler adds a new node for it.

Note

Depending on your cloud environment, networking setup, container registry, and type of workload you might observe different times than the ones shown above. For example, pulling a large container image for your workload from a distant container registry will add extra overhead.

From the chart above we draw the following conclusions:

Baseline becomes ready soon after all EKS cluster nodes become ready.
Rok needs between 2 and 10 additional minutes to become ready, depending on cluster size.
Rok CSI becomes ready less than 5 minutes after Rok is ready. This is expected, since Rok CSI depends on Rok.
The workload becomes ready approximately 5 minutes after Rok CSI. Again, this is expected, since Rok CSI needs to mount the RWX Rok volume on each of the nodes.
The workload becomes ready approximately 7 minutes after baseline for 100 nodes, 13 minutes after baseline for 200 nodes and 16 minutes after baseline for 300 nodes. This represents the initialization overhead due to all workload replicas using a single Rok RWX volume.

Challenges ¶

Scaling to hundreds of nodes comes with multiple challenges, such as optimization of resource usage, elimination of bottlenecks and parallelization of operations. Below are some of the most important things that you should be aware of.

Rok etcd Volume Type ¶

Inevitably, on a large-scale EKF cluster that runs hundreds of Rok instances, the traffic from and to Rok etcd becomes significantly higher. In this context, for clusters larger than 100 nodes, we highly recommend that you switch from a gp2 to a io1 volume for Rok etcd, since io1 volumes are designed for I/O-intensive workloads that are sensitive to storage consistency.

Rok Readiness ¶

When scaling your EKF cluster to hundreds of nodes, Rok Pods will be created automatically on all eligible nodes. For each eligible node that becomes ready, Rok Operator adds exactly one new member to the Rok cluster. Since multiple cluster nodes become ready simultaneously, many Rok Pods start their initialization process and join the cluster in parallel.

This is known to cause the readiness probe of existing Rok pods to temporarily fail. Therefore, it is expected that you see Rok Pods transition from Ready to NotReady during scale up, until Rok and Rok CSI eventually become ready.

Resource Quotas ¶

Depending on your AWS billing account, different limits on cloud resources might apply for your cluster. Therefore, before scaling up your EKF cluster to hundreds of nodes, you need to ensure that your account limits are above the sum of resources that you need to spawn new nodes.

Troubleshooting

vCPU Limit Exceeded

After requesting to scale your EKF cluster to hundreds of nodes on AWS, your node group might be in Degraded state due to the following error:

Could not launch On-Demand Instances. VcpuLimitExceeded - You have
requested more vCPU capacity than your current vCPU limit of N
allows for the instance bucket that the specified instance type
belongs to. Please visit http://aws.amazon.com/contact-us/ec2-request
to request an adjustment to this limit. Launching EC2 instance failed.

In this case, contact AWS support and request to increase the instance limits for your account. For more information on how to calculate the correct instance limits for your needs and request them from AWS see the official user guide for On-Demand Instance limits.

Cloud Provider Capacity ¶

Depending on your region and availability zone, the capacity of your cloud provider might vary. Therefore, before scaling up your EKF cluster to hundreds of nodes, you need to ensure that your cloud provider has enough available resources to spawn new Kubernetes nodes for your cluster.

Troubleshooting

Insufficient Instance Capacity

After requesting to scale your EKF cluster to hundreds of nodes on AWS, your node group might be in Degraded state due to the following error:

Could not launch On-Demand Instances. InsufficientInstanceCapacity -
We currently do not have sufficient m5d.4xlarge capacity in the
Availability Zone you requested (us-east-1a). Our system will be
working on provisioning additional capacity. You can currently get
m5d.4xlarge capacity by not specifying an Availability Zone in
your request or choosing us-east-1b, us-east-1c,
us-east-1d, us-east-1f. Launching EC2 instance failed.

In this case you can either:

Wait until AWS provisions additional capacity for your EKS cluster.
Contact AWS support to serve your request.
Add an additional node group to your cluster in a different Availability Zone.

Insufficient Free Addresses

After requesting to scale your EKF cluster to hundreds of nodes on AWS, your node group might be in Degraded state due to the following error:

Amazon AutoScaling was unable to launch instances because there are
not enough free addresses in the subnet associated with your
AutoScaling group(s).

In this case you need to free up enough addresses in the specified subnet, for example, by deleting EKS cluster node groups or EC2 instances you no longer need.

Docker Hub Rate Limiting ¶

As stated in the official rate limiting documentation, Docker Hub limits the number of Docker image pulls based on the account type of the user pulling the image. Pull rates limits are based on individual IP address:

For anonymous users, the rate limit is set to 100 pulls per 6 hours per IP address.
For authenticated users, the rate limit is set to 200 pulls per 6 hours.
For users with a paid Docker subscription no rate limit applies.

By default, Istio (the service mesh of EKF) specifies container images hosted on Docker Hub for its proxy sidecar. Therefore, before scaling up your EKF cluster to hundreds of nodes, you need to ensure that the pull limits of the Docker Hub account you are using cover your needs.

Troubleshooting

Image Pull BackOff

When runnning workloads in an Istio-protected Kubernetes namespace on a multi hundred node EKF cluster, it could be the case that you exceed the maximum number of allowed pulls, which results in Pods getting in ImagePullBackOff phase due to the following error:

image "docker.io/istio/proxyv2:1.9.6": rpc error: code = Unknown desc = Error response from daemon:
toomanyrequests: You have reached your pull rate limit.
You may increase the limit by authenticating and upgrading: https://www.docker.com/increase-rate-limit

In this case, you can either

use a private container registry to host the Istio proxy container image and set up cluster-wide access to it or
upgrade to a paid Docker subscription so that you do not get limited.

Scaling¶

Upper Bound¶

Challenges¶

Rok etcd Volume Type¶

Rok Readiness¶

Resource Quotas¶

Cloud Provider Capacity¶

Docker Hub Rate Limiting¶

Upper Bound ¶

Challenges ¶

Rok etcd Volume Type ¶

Rok Readiness ¶

Resource Quotas ¶

Cloud Provider Capacity ¶

Docker Hub Rate Limiting ¶