Rok 1.2¶
This guide describes the necessary steps to upgrade an existing Rok cluster on Kubernetes from version 1.1 to the latest version, 1.4.4.
Check Kubernetes Version¶
Rok 1.2 only supports Kubernetes version 1.17 or 1.18. Follow the instructions bellow to verify the Kubernetes version for your cluster, before continuing with the upgrade.
Check your cluster version by inspecting the value of
Server Version
in the following command:root@rok-tools:/# kubectl version --short Client Version: v1.17.17 Server Version: v1.17.17-eks-c5067d
If your Server Version is
v1.17.*
orv1.18.*
, you may proceed with the upgrade. If not, please follow the upgrade your cluster guide to upgrade your cluster to Kubernetes 1.17.
Upgrade your management environment¶
We assume that you have followed the Deploy Rok Components guide, and have
successfully set up a full-fledged rok-tools
management environment either
in local Docker or in Kubernetes.
Before proceeding with the core upgrade steps you need to first upgrade your
management environment, in order to use CLI tools and utilities, such as
rok-deploy
that are compatible with the Rok version you are upgrading to.
Important
When you upgrade your management environment all previous data (GitOps
repository, files, user settings, etc.) are preserved either in a Docker
volume or Kubernetes PVC, depending on your environment. This volume or PVC
is mounted in the new rok-tools
container so that old data is adopted.
For Kubernetes simply apply the latest rok-tools
manifests:
$ kubectl apply -f <download_root>/rok-tools-eks.yaml
Note
In case you see the following error:
The StatefulSet "rok-tools" is invalid: spec: Forbidden: updates to
statefulset spec for fields other than 'replicas', 'template', and
'updateStrategy' are forbidden
make sure your first delete the existing rok-tools
StatefulSet with:
$ kubectl delete sts rok-tools
and then re-apply.
For Docker first delete the old container:
$ docker stop <OLD_ROK_TOOLS_CONTAINER_ID>
$ docker rm <OLD_ROK_TOOLS_CONTAINER_ID>
and then create a new one with previous data and the new image:
$ docker run -ti \
> --name rok-tools \
> --hostname rok-tools \
> -p 8080:8080 \
> --entrypoint /bin/bash \
> -v $(pwd)/rok-tools-data:/root \
> -v /var/run/docker.sock:/var/run/docker.sock \
> -w /root \
> gcr.io/arrikto/rok-tools:release-1.4-l0-release-1.4.4
Upgrade manifests¶
We assume that you have followed the Deploy Rok Components guide, and have a local GitOps repo with Arrikto-provided manifests. Once Arrikto releases a new Rok version and pushes updated deployment manifests, you have to follow the standard GitOps workflow:
- Fetch latest upstream changes, pushed by Arrikto.
- Rebase local changes on top of the latest upstream ones and resolve conflicts, if any.
- Tweak manifests based on Arrikto-provided instructions, if necessary.
- Commit everything.
- Re-apply manifests.
When one initially deploys Rok on Kubernetes, either automatically using
rok-deploy
or manually, they end up with a deploy
overlay in each Rok
component or external service that is to be applied to Kubernetes. In the GitOps
deployment repository, Arrikto provides manifests that include the deploy
overlay in each Kustomize app/package as scaffold, so that users can quickly
start and set their preferences.
As a result, fetch/rebase might lead to conflicts since both Arrikto and the end-user might modify the same files that are tracked by Git. In this scenario, the most common and obvious solution is to keep the user's changes since they are the ones that reflect the existing deployment.
In case of breaking changes, e.g., parts of YAML documents that are absolutely necessary to perform the upgrade, or others that might be deprecated, Arrikto will inform users via version-specific upgrade nodes for all actions that need to be taken.
Note
It is the user's responsibility to apply valid manifests and kustomizations after a rebase. In case of uncertainty do not hesitate to coordinate with Arrikto's Tech Team for support.
We will use git
to update local manifests. You are about to rebase your work
on top of latest pre-release branch. To favor local changes upon conflicts, we
will use the corresponding merge strategy option.
Important
Make sure you mirror the GitOps repo to a private remote to be able to recover it in case of failure.
To upgrade the manifests:
Go to your GitOps repository, inside your
rok-tools
management environment:root@rok-tools:/# cd ~/ops/deployments
Save the current branch:
root@rok-tools:~/ops/deployments# export OLD_BRANCH="$(git rev-parse --abbrev-ref HEAD)"
Fetch latest upstream changes:
root@rok-tools:~/ops/deployments# git fetch --all -p Fetching origin
Ensure the release channel you are currently following is
release-1.1
:root@rok-tools:~/ops/deployments# git rev-parse --abbrev-ref --symbolic-full-name @{u} origin/release-1.1
If you are following the
release-1.2
release channel already, you can skip to step 6.Follow the Switch release channel section to update to the
release-1.2
release channel. You can skip this step if you are already in therelease-1.2
release channel.Rebase on top of the latest pre-release version:
root@rok-tools:~/ops/deployments# git rebase -Xtheirs
Update Jupyter Web App Config for Kubeflow 1.3¶
In Kubeflow 1.3, Jupyter Web App's ConfigMap in the deploy overlay has changed, and a rebase will result in an invalid configuration. To upgrade the Jupyter Web App configuration:
Reset the configuration to the default upstream one:
root@rok-tools:~/ops/deployments# cp kubeflow/manifests/apps/jupyter/jupyter-web-app/upstream/overlays/ekf/patches/config-map.yaml kubeflow/manifests/apps/jupyter/jupyter-web-app/upstream/overlays/deploy/patches/config-map.yaml
Commit your changes:
root@rok-tools:~/ops/deployments# git commit -am "kubeflow: Reset Jupyter Web App config to 1.3 upstream"
View your previous changes, so that you can easily apply them again:
root@rok-tools:~/ops/deployments# git diff origin/release-1.1...$OLD_BRANCH -- kubeflow/manifests/jupyter/jupyter-web-app/overlays/deploy/patches/config-map.yaml
Edit the Jupyter Web App configuration and re-apply your old changes, as you saw them above:
root@rok-tools:~/ops/deployments# vim kubeflow/manifests/apps/jupyter/jupyter-web-app/upstream/overlays/deploy/patches/config-map.yaml
Important
In Kubeflow 1.3, the
spawnerFormDefaults.image.readOnly
field was renamed tospawnerFormDefaults.allowCustomImage
. If you have changed thespawnerFormDefaults.image.readOnly
field, make sure to modifyspawnerFormDefaults.allowCustomImage
accordingly.Commit the new configuration:
root@rok-tools:~/ops/deployments# git commit -am "kubeflow: Update Jupyter Web App config"
Drain rok-csi nodes¶
To ensure minimal disruption of Rok services, please follow the following instructions to drain Rok CSI nodes, and wait for any pending Rok CSI operations to complete, before performing the upgrade.
During the upgrade, any pending Rok tasks will be canceled, so it is advisable to run the following steps in a period of inactivity, e.g., when no pipelines or snapshot policies run. Since pausing/queuing everything is currently not an option, one can monitor Rok logs and wait until nothing has been logged for, let's say, 30 secs:
root@rok-tools:~/ops/deployments# kubectl -n rok logs -l app=rok-csi-controller -c csi-controller -f --tail=100
Note
Finding a period of inactivity is an ideal scenario, that depending on the deployment may not be feasible, e.g., when having tens of recurring pipelines running. In such a case the end-user will simply see some of them fail.
Scale down the
rok-operator
StatefulSet:root@rok-tools:~/ops/deployments# kubectl -n rok-system scale sts rok-operator --replicas=0
Ensure
rok-operator
scaled down to zero:root@rok-tools:~/ops/deployments# kubectl get sts rok-operator -n rok-system
Scale down the
rok-csi-controller
StatefulSet:root@rok-tools:~/ops/deployments# kubectl -n rok scale sts rok-csi-controller --replicas=0
Ensure
rok-csi-controller
scaled down to zero:root@rok-tools:~/ops/deployments# kubectl get sts rok-csi-controller -n rok
Watch the
rok-csi-node
logs and ensure that all pending operations have finished, i.e., nothing has been logged for the last 30 secs:root@rok-tools:~/ops/deployments# kubectl -n rok logs -l app=rok-csi-node -c csi-node -f --tail=100
Continue with the Upgrade components section.
Upgrade components¶
We assume that you are already running a 1.1 Rok cluster on Kubernetes and that you also have access to the 1.4.4 kustomization tree you are upgrading to.
Since a Rok cluster on Kubernetes consists of multiple components, you need to upgrade each one of them. Throughout the guide, we will keep track of these components, as listed in the table below:
Component | old | new |
---|---|---|
RokCluster CR | ✔ | |
RokCluster CRD | ✔ | |
Rok Operator | ✔ | |
Rok Disk Manager | ✔ | |
Rok kmod | ✔ |
During the upgrade, Rok Operator will remove all members from the cluster and add a dedicated one to perform the upgrade. The cluster will be scaled down to zero and a Kubernetes Job will run to upgrade the cluster config on etcd and run any needed migrations. Finally, the cluster will be scaled back up to its initial size.
1. Increase observability (optional)¶
To gain insight into the status of the cluster upgrade execute the following commands in a separate window:
For live cluster status:
root@rok-tools:~/ops/deployments# watch kubectl get rokcluster -n rok
For live cluster events:
root@rok-tools:~/ops/deployments# watch 'kubectl describe rokcluster -n rok rok | tail -n 20'
2. Inspect current version (optional)¶
Get current images and version from the RokCluster CR:
root@rok-tools:~/ops/deployments# kubectl describe rokcluster rok -n rok
...
Spec:
Images:
Rok: gcr.io/arrikto-deploy/roke:l0-release-v1.1
Rok CSI: gcr.io/arrikto-deploy/rok-csi:l0-release-v1.1
Status:
Version: release-1.1-l0-release-1.1
3. Upgrade Rok Disk Manager¶
Apply the latest Rok Disk Manager manifests:
root@rok-tools:~/ops/deployments# rok-deploy --apply rok/rok-disk-manager/overlays/deploy
Component | old | new |
---|---|---|
RokCluster CR | ✔ | |
RokCluster CRD | ✔ | |
Rok Operator | ✔ | |
Rok Disk Manager | ✔ | |
Rok kmod | ✔ |
4. Upgrade Rok kmod¶
Apply the latest Rok kmod manifests:
root@rok-tools:~/ops/deployments# rok-deploy --apply rok/rok-kmod/overlays/deploy
Component | old | new |
---|---|---|
RokCluster CR | ✔ | |
RokCluster CRD | ✔ | |
Rok Operator | ✔ | |
Rok Disk Manager | ✔ | |
Rok kmod | ✔ |
5. Upgrade Rok cluster¶
Apply the latest Rok cluster manifests:
root@rok-tools:~/ops/deployments# rok-deploy --apply rok/rok-cluster/overlays/deploy
Component | old | new |
---|---|---|
RokCluster CR | ✔ | |
RokCluster CRD | ✔ | |
Rok Operator | ✔ | |
Rok Disk Manager | ✔ | |
Rok kmod | ✔ |
6. Upgrade Rok Operator¶
Apply the latest Operator manifests:
root@rok-tools:~/ops/deployments# rok-deploy --apply rok/rok-operator/overlays/deploy
Note
The above command also updates the RokCluster
CRD
After the manifests have been applied, ensure Rok Operator has become ready by running the following command:
root@rok-tools:~/ops/deployments# watch kubectl get pods -n rok-system -l app=rok-operator
Component | old | new |
---|---|---|
RokCluster CR | ✔ | |
RokCluster CRD | ✔ | |
Rok Operator | ✔ | |
Rok Disk Manager | ✔ | |
Rok kmod | ✔ |
7. Verify successful upgrade for Rok¶
Check the status of the cluster upgrade Job:
root@rok-tools:~/ops/deployments# kubectl get job -n rok rok-upgrade-release-1.4-l0-release-1.4.4
Ensure that Rok is up and running after the upgrade Job finishes:
root@rok-tools:~/ops/deployments# kubectl get rokcluster -n rok rok
Ensure all pods in the
rok-system
namespace are up-and-running:root@rok-tools:~/ops/deployments# kubectl get pods -n rok-system
Ensure all pods in the
rok
namespace are up-and-running:root@rok-tools:~/ops/deployments# kubectl get pods -n rok
Upgrade NGINX Ingress Controller¶
This section describes how to upgrade the NGINX Ingress Controller. Run the following command to upgrade it:
root@rok-tools:~/ops/deployments# rok-deploy --apply rok/nginx-ingress-controller/overlays/deploy/
Upgrade Istio¶
Rok 1.4.4 uses Istio 1.9.5. To upgrade from Istio 1.5.7 follow the next steps:
Delete the previous Istio control plane installation:
root@rok-tools:~/ops/deployments# rok-deploy --delete \ > rok/rok-external-services/istio/istio-1-5-7/istio-install-1-5-7/overlays/deploy \ > rok/rok-external-services/istio/istio-1-5-7/cluster-local-gateway-1-5-7/overlays/deploy
Apply the new Istio control plane:
root@rok-tools:~/ops/deployments# rok-deploy --apply \ > rok/rok-external-services/istio/istio-1-9/istio-crds/overlays/deploy \ > rok/rok-external-services/istio/istio-1-9/istio-namespace/overlays/deploy \ > rok/rok-external-services/istio/istio-1-9/istio-install/overlays/deploy \ > rok/rok-external-services/istio/istio-1-9/cluster-local-gateway/overlays/deploy
Delete deprecated resources:
root@rok-tools:~/ops/deployments# rok-kf-prune --app istio
Confirm that the
knative-serving
andkubeflow
namespaces, as well as all of the kubeflow user namespaces (namespaces that start withkubeflow-
) have Istio sidecar injection enabled. To do this, run the following command and confirm that these namespaces show up in the command's output:root@rok-tools:~/ops/deployments# kubectl get ns -l istio-injection=enabled NAME STATUS AGE knative-serving Active 5d16h kubeflow Active 5d16h kubeflow-user Active 5d16h ...
Upgrade the Istio sidecars, by deleting all Pods in the namespaces you found above. Istio will inject the new version sidecar once the owning controllers recreate the deleted Pods:
root@rok-tools:~/ops/deployments# kubectl get ns -l istio-injection=enabled --no-headers | \ > awk '{print $1}' | \ > xargs -n1 -I {} kubectl delete pod --all -n {}
Follow the Expose Istio guide from scratch to reconfigure and re-apply the necessary resources. Choose based on your cloud provider and the load balancer type you use.
Restart the AuthService:
root@rok-tools:~/ops/deployments# kubectl rollout restart statefulset -n istio-system authservice
Important
At this point Authservice will not be able to talk to Dex due to a missing AuthorizationPolicy. The pod will not become ready until you upgrade Kubeflow.
Upgrade Kubeflow manifests¶
Important
Kubeflow 1.3 includes a new version of Katib that is not backwards-compatible with previous Kubeflow versions. This means that you will lose all Experiment, Suggestion and Trial CRs. If there are hyperparameter tuning jobs in-progress, they will be deleted.
This section describes how to upgrade Kubeflow. If you have not deployed Kubeflow in your cluster, you can safely skip this section.
Run the following command to update your Kubeflow installation:
root@rok-tools:~/ops/deployments# rok-deploy --apply install/kubeflow --force --force-kinds CustomResourceDefinition Deployment StatefulSet
Restart Kubeflow Conversion Webhooks¶
During the upgrade, we update the KFServing and KNative CRs using conversion webhooks. Restart the corresponding Pods to allow Kubernetes to re-establish the connection with these webhooks.
Delete the KFserving webhook Pod:
root@rok-tools:~/ops/deployments# kubectl delete pods \ > -n kubeflow kfserving-controller-manager-0
Delete the KNative webhook Pod:
root@rok-tools:~/ops/deployments# kubectl delete pods \ > -n knative-serving -l role=webhook
Restart Kubeflow Admission Webhook¶
During the upgrade, we regenerate the Certificate for the admission webhook in order to change the Issuer. Restart the admission-webhook deployment so that it uses the new Certificate:
root@rok-tools:~/ops/deployments# kubectl rollout restart \
> -n kubeflow deploy admission-webhook-deployment
Delete stale Kubeflow resources¶
Run the following command to remove the deprecated resources left by the previous version of Kubeflow:
root@rok-tools:~/ops/deployments# rok-kf-prune --app kubeflow
Upgrade Notebooks for Kubeflow 1.3¶
Restart all Notebooks with access to Kubeflow Pipelines, in order to inject the new authentication token needed by the latest version of Kubeflow Pipelines:
root@rok-tools:~/ops/deployments# kubectl delete pods -l 'access-ml-pipeline=true, notebook-name' --all-namespaces
Verify successful upgrade¶
Follow the Test Kubeflow section to validate the updated Rok + EKF deployment.