Illumina Innovates with Rancher and Kubernetes
This procedure describes how to use RKE to restore a snapshot of the Rancher Kubernetes cluster. The cluster snapshot will include Kubernetes configuration and the Rancher database and state.
Additionally, the pki.bundle.tar.gz file usage is no longer required as v0.2.0 has changed how the Kubernetes cluster state is stored.
pki.bundle.tar.gz
You will need RKE and kubectl CLI utilities installed.
Prepare by creating 3 new nodes to be the target for the restored Rancher instance. See HA Install for node requirements.
We recommend that you start with fresh nodes and a clean state. Alternatively you can clear Kubernetes and Rancher configurations from the existing nodes. This will destroy the data on these nodes. See Node Cleanup for the procedure.
IMPORTANT: Before starting the restore make sure all the Kubernetes services on the old cluster nodes are stopped. We recommend powering off the nodes to be sure.
The snapshot used to restore your etcd cluster is handled differently based on your version of RKE.
As of RKE v0.2.0, snapshots could be saved in an S3 compatible backend. To restore your cluster from the snapshot stored in S3 compatible backend, you can skip this step and retrieve the snapshot in Step 4: Restore Database. Otherwise, you will need to place the snapshot directly on the nodes.
Pick one of the clean nodes. That node will be the “target node” for the initial restore. Place your snapshot in /opt/rke/etcd-snapshots on the target node.
/opt/rke/etcd-snapshots
When you take a snapshot, RKE saves a backup of the certificates, i.e. a file named pki.bundle.tar.gz, in the same location. The snapshot and PKI bundle file are required for the restore process, and they are expected to be in the same location.
Pick one of the clean nodes. That node will be the “target node” for the initial restore. Place the snapshot and PKI certificate bundle files in the /opt/rke/etcd-snapshots directory on the target node.
<snapshot>.db
Make a copy of your original rancher-cluster.yml file.
rancher-cluster.yml
cp rancher-cluster.yml rancher-cluster-restore.yml
Modify the copy and make the following changes.
addons:
etcd
nodes:
Example rancher-cluster-restore.yml
rancher-cluster-restore.yml
nodes: - address: 52.15.238.179 # New Target Node user: ubuntu role: [ etcd, controlplane, worker ] # - address: 52.15.23.24 # user: ubuntu # role: [ etcd, controlplane, worker ] # - address: 52.15.238.133 # user: ubuntu # role: [ etcd, controlplane, worker ] # addons: |- # --- # kind: Namespace # apiVersion: v1 # metadata: # name: cattle-system # --- ...
Use RKE with the new rancher-cluster-restore.yml configuration and restore the database to the single “target node”.
RKE will create an etcd container with the restored database on the target node. This container will not complete the etcd initialization and stay in a running state until the cluster brought up in the next step.
When restoring etcd from a local snapshot, the snapshot is assumed to be located on the target node in the directory /opt/rke/etcd-snapshots.
Note: For RKE v0.1.x, the pki.bundle.tar.gz file is also expected to be in the same location.
rke etcd snapshot-restore --name <snapshot>.db --config ./rancher-cluster-restore.yml
Available as of RKE v0.2.0
When restoring etcd from a snapshot located in an S3 compatible backend, the command needs the S3 information in order to connect to the S3 backend and retrieve the snapshot.
Note: Ensure your cluster.rkestate is present before starting the restore, as this contains your certificate data for the cluster.
cluster.rkestate
$ rke etcd snapshot-restore --config cluster.yml --name snapshot-name \ --s3 --access-key S3_ACCESS_KEY --secret-key S3_SECRET_KEY \ --bucket-name s3-bucket-name --s3-endpoint s3.amazonaws.com \ --folder folder-name # Available as of v2.3.0
rke etcd snapshot-restore
S3 specific options are only available for RKE v0.2.0+.
--name
--config
--s3
--s3-endpoint
--access-key
--secret-key
--bucket-name
--folder
--region
--ssh-agent-auth
--ignore-docker-version
Use RKE and bring up the cluster on the single “target node.”
Note: For users running RKE v0.2.0+, ensure your cluster.rkestate is present before starting the restore, as this contains your certificate data for the cluster.
rke up --config ./rancher-cluster-restore.yml
Once RKE completes it will have created a credentials file in the local directory. Configure kubectl to use the kube_config_rancher-cluster-restore.yml credentials file and check on the state of the cluster. See Installing and Configuring kubectl for details.
kubectl
kube_config_rancher-cluster-restore.yml
Your new cluster will take a few minutes to stabilize. Once you see the new “target node” transition to Ready and three old nodes in NotReady you are ready to continue.
Ready
NotReady
kubectl get nodes NAME STATUS ROLES AGE VERSION 52.15.238.179 Ready controlplane,etcd,worker 1m v1.10.5 18.217.82.189 NotReady controlplane,etcd,worker 16d v1.10.5 18.222.22.56 NotReady controlplane,etcd,worker 16d v1.10.5 18.191.222.99 NotReady controlplane,etcd,worker 16d v1.10.5
Use kubectl to delete the old nodes from the cluster.
kubectl delete node 18.217.82.189 18.222.22.56 18.191.222.99
Reboot the target node to ensure the cluster networking and services are in a clean state before continuing.
Wait for the pods running in kube-system, ingress-nginx and the rancher pod in cattle-system to return to the Running state.
kube-system
ingress-nginx
rancher
cattle-system
Running
Note: cattle-cluster-agent and cattle-node-agent pods will be in an Error or CrashLoopBackOff state until Rancher server is up and the DNS/Load Balancer have been pointed at the new cluster.
cattle-cluster-agent
cattle-node-agent
Error
CrashLoopBackOff
kubectl get pods --all-namespaces NAMESPACE NAME READY STATUS RESTARTS AGE cattle-system cattle-cluster-agent-766585f6b-kj88m 0/1 Error 6 4m cattle-system cattle-node-agent-wvhqm 0/1 Error 8 8m cattle-system rancher-78947c8548-jzlsr 0/1 Running 1 4m ingress-nginx default-http-backend-797c5bc547-f5ztd 1/1 Running 1 4m ingress-nginx nginx-ingress-controller-ljvkf 1/1 Running 1 8m kube-system canal-4pf9v 3/3 Running 3 8m kube-system cert-manager-6b47fc5fc-jnrl5 1/1 Running 1 4m kube-system kube-dns-7588d5b5f5-kgskt 3/3 Running 3 4m kube-system kube-dns-autoscaler-5db9bbb766-s698d 1/1 Running 1 4m kube-system metrics-server-97bc649d5-6w7zc 1/1 Running 1 4m kube-system tiller-deploy-56c4cf647b-j4whh 1/1 Running 1 4m
Edit the rancher-cluster-restore.yml RKE config file and uncomment the additional nodes.
nodes: - address: 52.15.238.179 # New Target Node user: ubuntu role: [ etcd, controlplane, worker ] - address: 52.15.23.24 user: ubuntu role: [ etcd, controlplane, worker ] - address: 52.15.238.133 user: ubuntu role: [ etcd, controlplane, worker ] # addons: |- # --- # kind: Namespace ...
Run RKE and add the nodes to the new cluster.
Rancher should now be running and available to manage your Kubernetes clusters. Review the recommended architecture for HA installations and update the endpoints for Rancher DNS or the Load Balancer that you built during Step 1 of the HA install (1. Create Nodes and Load Balancer) to target the new cluster. Once the endpoints are updated, the agents on your managed clusters should automatically reconnect. This may take 10-15 minutes due to reconnect back off timeouts.
IMPORTANT: Remember to save your new RKE config (rancher-cluster-restore.yml) and kubectl credentials (kube_config_rancher-cluster-restore.yml) files in a safe place for future maintenance.