Kiwi¶
Currently, Kubernetes assigns GPUs exclusively to Pods. This is especially inefficient in interactive scenarios, such as a development using a Jupyter notebook server, in which a Pod, or application, has large idle periods.
Kiwi is a GPU sharing mechanism that enables multiple containers (belonging to the same or different Pods) to run on the same GPU concurrently, each one having the whole GPU memory available for use. It achieves this by transparently paging out the GPU memory of idle processes, using the system RAM as swap space.
Attention
This feature is a Tech Preview, so it is not fully supported by Arrikto and may not be functionally complete. While it is not intended for production use, we encourage you to try it out and provide us feedback.
Read our Tech Preview Policy for more information.
Also, check out the Kiwi User Guide to find out more on how to use vGPUs.
See also
Schedule Arrikto vGPUs on Kubernetes¶
The Kiwi Device Plugin advertizes multiple Arrikto virtual GPUs (vGPUs) per
Kiwi-enabled GPU it manages. As such, the Kubernetes scheduler can now assign
multiple Pods (that request an arrikto.com/gpu
device) to the same physical
GPU. Here’s a simple visualization:
Kiwi Scheduler¶
Important
Each instance of the Kiwi Scheduler manages a single Kiwi-enabled NVIDIA GPU. When we refer to it in the singular, we simply refer to an arbitrary instance of it.
When the combined GPU memory usage of the collocated applications fits in GPU memory, then they can run in parallel without any intervention.
However, when the combined memory usage exceeds the total GPU memory, Kiwi must enforce serialization of GPU work among the applications in order to avoid thrashing. Thrashing is a situation in which time spent handling page faults overwhelms time spent doing useful computations.
Kiwi offers an anti-thrashing mechanism via the Kiwi Scheduler. The Kiwi
Scheduler assigns exclusive usage of the whole GPU to a single application at a
time, rotating between competing applications in a round-robin fashion. Each
application can use the GPU for a time quantum (TQ
seconds).
Note
The Kiwi Scheduler is not related to the Kubernetes scheduler in any way.
The Kiwi Scheduler manages one Kiwi-enabled physical NVIDIA GPU within a
single node. It “schedules” exclusive access to the GPU for each time
quantum (TQ
).
Note
You can configure the Kiwi Scheduler’s TQ
. See the related
Operations guide.
Important
By default, the Kiwi Scheduler is enabled, meaning that anti-thrashing is enabled. If you disable it without ensuring that the working sets (GPU memory) of collocated applications fit in GPU memory, you can cause thrashing and, hence, severe performance degradation.
See how you can enable or disable the Kiwi Scheduler.
Example Timeline of Kiwi Applications¶
Let’s examine a graph that shows the execution timeline of two different
applications using vGPUs and running on the same physical GPU. We start
examining their behavior at an arbitrary point in time, T0
. Let’s assume
both of these two applications are Jupyter notebooks on which an ML engineer
is experimenting.
Note
We assume that the Kiwi Scheduler is enabled, therefore when GPU bursts overlap, the scheduler serializes work on the physical GPU, giving exclusive access to one application at a time.
Let’s examine what happens at each point in time:
T0-T1
:- Application A is doing CPU work, for example data preprocessing.
- Application B is idle. The developer might be tweaking their code or taking a break.
T1
:At point
T1
, application A starts running a cell that does GPU computations. It requests the GPU from the Kiwi Scheduler, and since no other application is using it at the moment, the scheduler immediately grants it access forTQ
seconds.T1-T2
:- Application A runs GPU code.
- Application B runs CPU code.
T2
:Application B wants to run GPU code, so it requests access from the Kiwi Scheduler. However, the scheduler has currently given access to A, so B has to wait for the
TQ
to elapse or for application A to release the GPU early if its GPU burst is shorter thanTQ
seconds.T2-T3
:- Application A runs GPU code.
- Application B waits for the GPU.
T3 (T1 + TQ)
:The time quantum (
TQ
) elapses.- Application A relinquishes the GPU. Since it still wants to run GPU work, it requests it from the Kiwi Scheduler and enters a waiting state.
- The Kiwi Scheduler gives access to application B.
T3-T4
:- Applications A waits for the GPU.
- Application B runs GPU code.
T4 (< T3 + TQ)
:- Application B no longer needs to run GPU work. Since it did not need the
whole
TQ
to finish its GPU burst, it relinquishes the GPU early. - The Kiwi Scheduler gives exclusive access to application A once more.
- Application B no longer needs to run GPU work. Since it did not need the
whole
T > T4
:There are no more overlapping GPU bursts, so the applications do not have to wait for access to the GPU. They run both their CPU and GPU parts unhindered.