Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GKE to AWS - msg="unable to successfully complete restic restores of pod's volumes" error="timed out waiting for all PodVolumeRestores to complete" #1993

Closed
prassoma opened this issue Oct 24, 2019 · 7 comments
Labels
Restic Relates to the restic integration

Comments

@prassoma
Copy link

prassoma commented Oct 24, 2019

What steps did you take and what happened:
What steps did you take and what happened:*
I was trying to restore velero restic backup taken from GKE cluster to AWS(kops) cluster.
On target AWS, backup location is pointing to the GCP bucket where backup. has been taken.
Once I initiate the restore from AWS K8s cluster, it starts restoration, and it even restores, namespace, Deployments, pods, Persitent Volumes etc. without any issues.
It stucks in pending status for a very long time and finally comes out as partially failed. From the logs we can see below message

time="2019-10-24T15:15:12Z" level=error msg="unable to successfully complete restic restores of pod's volumes" error="timed out waiting for all PodVolumeRestores to complete" logSource="pkg/restore/restore.go:1126" restore=velero/restore-from-gcp
time="2019-10-24T15:15:12Z" level=error msg="unable to successfully complete restic restores of pod's volumes" error="timed out waiting for all PodVolumeRestores to complete" logSource="pkg/restore/restore.go:1126" restore=velero/restore-from-gcp
time="2019-10-24T15:15:12Z" level=info msg="restore completed" logSource="pkg/controller/restore_controller.go:465" restore=velero/restore-from-gcp.

But the strange thing is, it is able to restore all other componentes including the creation of Persistent Volumes but unable to restore the data.

Source

GKE Cluster

Source backup location: GCP bucket

[prassomanp_gmail_com@bastion wp-mysql]$ velero create backup wp-mysql --include-namespaces webapp
Backup request "wp-mysql" submitted successfully.
Run velero backup describe wp-mysql or velero backup logs wp-mysql for more details.
[prassomanp_gmail_com@bastion wp-mysql]$
[prassomanp_gmail_com@bastion velero-v1.1.0-linux-amd64]$ velero backup describe wp-mysql --details
Name: wp-mysql
Namespace: velero
Labels: velero.io/storage-location=default
Annotations:

Phase: Completed

Namespaces:
Included: webapp
Excluded:

Resources:
Included: *
Excluded:
Cluster-scoped: auto

Label selector:

Storage Location: default

Snapshot PVs: auto

TTL: 720h0m0s

Hooks:

Backup Format Version: 1

Started: 2019-10-24 19:18:09 +0530 IST
Completed: 2019-10-24 19:18:28 +0530 IST

Expiration: 2019-11-23 19:18:09 +0530 IST

Resource List:
apps/v1/Deployment:
- webapp/wordpress
- webapp/wordpress-mysql
apps/v1/ReplicaSet:
- webapp/wordpress-dccb8668f
- webapp/wordpress-mysql-7d4fc77fdc
v1/Endpoints:
- webapp/wordpress
- webapp/wordpress-mysql
v1/Event:
- webapp/mysql-pv-claim.15d0981661b47672
- webapp/wordpress-dccb8668f-zzx65.15d0981db4ec6cfe
- webapp/wordpress-dccb8668f-zzx65.15d0981e2c0c49cb
- webapp/wordpress-dccb8668f-zzx65.15d0981f32b733d9
- webapp/wordpress-dccb8668f-zzx65.15d09820af5e6504
- webapp/wordpress-dccb8668f-zzx65.15d0982466dfc134
- webapp/wordpress-dccb8668f-zzx65.15d09824f8788385
- webapp/wordpress-dccb8668f-zzx65.15d09825030857dc
- webapp/wordpress-dccb8668f.15d0981db4da467e
- webapp/wordpress-mysql-7d4fc77fdc-bx6rz.15d09815e87ffe56
- webapp/wordpress-mysql-7d4fc77fdc-bx6rz.15d0981663f0b116
- webapp/wordpress-mysql-7d4fc77fdc-bx6rz.15d09817691f3518
- webapp/wordpress-mysql-7d4fc77fdc-bx6rz.15d0981ac656a7f5
- webapp/wordpress-mysql-7d4fc77fdc-bx6rz.15d0981aca686649
- webapp/wordpress-mysql-7d4fc77fdc-bx6rz.15d0981ad5657418
- webapp/wordpress-mysql-7d4fc77fdc.15d09815e8236957
- webapp/wordpress-mysql.15d09815e6831b9b
- webapp/wordpress.15d0981d9fe6ef20
- webapp/wordpress.15d0981db2a2e909
- webapp/wordpress.15d0982a8e3167ec
- webapp/wp-pv-claim.15d0981e29d3b0ed
v1/Namespace:
- webapp
v1/PersistentVolume:
- pvc-a79b80c7-f661-11e9-a78e-42010a80013b
- pvc-bb9786ef-f661-11e9-a78e-42010a80013b
v1/PersistentVolumeClaim:
- webapp/mysql-pv-claim
- webapp/wp-pv-claim
v1/Pod:
- webapp/wordpress-dccb8668f-zzx65
- webapp/wordpress-mysql-7d4fc77fdc-bx6rz
v1/ResourceQuota:
- webapp/gke-resource-quotas
v1/Secret:
- webapp/default-token-rc6mx
- webapp/mysql-pass
v1/Service:
- webapp/wordpress
- webapp/wordpress-mysql
v1/ServiceAccount:
- webapp/default

Persistent Volumes:

Restic Backups:
Completed:
webapp/wordpress-dccb8668f-zzx65: wordpress-persistent-storage
webapp/wordpress-mysql-7d4fc77fdc-bx6rz: mysql-persistent-storage
[prassomanp_gmail_com@bastion velero-v1.1.0-linux-amd64]$

Target

AWS cluster (KOPS)
backup-location: Source GCP bucket

[ec2-user@ip-172-31-87-112 velero-v1.1.0-linux-amd64]$ velero backup-location get
NAME PROVIDER BUCKET/PREFIX ACCESS MODE
default gcp gcpvelerotest ReadWrite
[ec2-user@ip-172-31-87-112 velero-v1.1.0-linux-amd64]$ velero backup get
NAME STATUS CREATED EXPIRES STORAGE LOCATION SELECTOR
backup-w-annotate Completed 2019-10-23 17:30:55 +0530 IST 28d default
backup-wo-annotate Completed 2019-10-23 17:25:00 +0530 IST 28d default
wp-mysql Completed 2019-10-24 19:18:09 +0530 IST 29d default
[ec2-user@ip-172-31-87-112 velero-v1.1.0-linux-amd64]$
[ec2-user@ip-172-31-87-112 velero-v1.1.0-linux-amd64]$ velero restore create restore-from-gcp --from-backup wp-mysql
Restore request "restore-from-gcp" submitted successfully.
Run velero restore describe restore-from-gcp or velero restore logs restore-from-gcp for more details.
[ec2-user@ip-172-31-87-112 velero-v1.1.0-linux-amd64]$

[ec2-user@ip-172-31-87-112 velero-v1.1.0-linux-amd64]$ velero restore get
NAME BACKUP STATUS WARNINGS ERRORS CREATED SELECTOR
restore-from-gcp wp-mysql InProgress 0 0 2019-10-24 19:45:12 +0530 IST
[ec2-user@ip-172-31-87-112 velero-v1.1.0-linux-amd64]$


While restore is still in progress, the namespace got restored in aws along with Pods,services and Persistent volumes.

[ec2-user@ip-172-31-87-112 velero-v1.1.0-linux-amd64]$ kubectl get all -n webapp
NAME READY STATUS RESTARTS AGE
pod/wordpress-76b5d9f5c8-hfnjr 1/1 Running 0 37m
pod/wordpress-mysql-66594fb556-fpmsp 1/1 Running 0 37m

NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
service/wordpress LoadBalancer 100.66.57.204 ab25be46af66811e9a4310a2a60e9fd1-495652919.us-east-1.elb.amazonaws.com 80:32467/TCP 37m
service/wordpress-mysql ClusterIP None 3306/TCP 37m

NAME READY UP-TO-DATE AVAILABLE AGE
deployment.apps/wordpress 1/1 1 1 37m
deployment.apps/wordpress-mysql 1/1 1 1 37m

NAME DESIRED CURRENT READY AGE
replicaset.apps/wordpress-76b5d9f5c8 1 1 1 37m
replicaset.apps/wordpress-dccb8668f 0 0 0 37m
replicaset.apps/wordpress-mysql-66594fb556 1 1 1 37m
replicaset.apps/wordpress-mysql-7d4fc77fdc 0 0 0 37m
[ec2-user@ip-172-31-87-112 velero-v1.1.0-linux-amd64]$

[ec2-user@ip-172-31-87-112 velero-v1.1.0-linux-amd64]$ kubectl get pvc -n webapp
NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS AGE
mysql-pv-claim Bound pvc-b1c5cf63-f668-11e9-a431-0a2a60e9fd17 3Gi RWO gp2 37m
wp-pv-claim Bound pvc-b1cb2c62-f668-11e9-a431-0a2a60e9fd17 3Gi RWO gp2 37m
[ec2-user@ip-172-31-87-112 velero-v1.1.0-linux-amd64]$

And finally, it shows as partially failed.

[ec2-user@ip-172-31-87-112 velero-v1.1.0-linux-amd64]$ velero restore describe restore-from-gcp --details
Name: restore-from-gcp
Namespace: velero
Labels:
Annotations:

Phase: PartiallyFailed (run 'velero restore logs restore-from-gcp' for more information)

Errors:
Velero: timed out waiting for all PodVolumeRestores to complete
Cluster:
Namespaces:

Backup: wp-mysql

Namespaces:
Included: *
Excluded:

Resources:
Included: *
Excluded: nodes, events, events.events.k8s.io, backups.velero.io, restores.velero.io, resticrepositories.velero.io
Cluster-scoped: auto

Namespace mappings:

Label selector:

Restore PVs: auto

Restic Restores:
New:
webapp/wordpress-dccb8668f-zzx65: wordpress-persistent-storage
webapp/wordpress-mysql-7d4fc77fdc-bx6rz: mysql-persistent-storage
[ec2-user@ip-172-31-87-112 velero-v1.1.0-linux-amd64]$

What did you expect to happen:

What steps did you take and what happened:*
I was trying to restore velero restic backup taken from GKE cluster to AWS(kops) cluster.
On target AWS, backup location is pointing to the GCP bucket where backup. has been taken.
Once I initiate the restore from AWS K8s cluster, it starts restoration, and it even restores, namespace, Deployments, pods, Persitent Volumes etc. without any issues.
It stucks in pending status for a very long time and finally comes out as partially failed. From the logs we can see below message

time="2019-10-24T15:15:12Z" level=error msg="unable to successfully complete restic restores of pod's volumes" error="timed out waiting for all PodVolumeRestores to complete" logSource="pkg/restore/restore.go:1126" restore=velero/restore-from-gcp
time="2019-10-24T15:15:12Z" level=error msg="unable to successfully complete restic restores of pod's volumes" error="timed out waiting for all PodVolumeRestores to complete" logSource="pkg/restore/restore.go:1126" restore=velero/restore-from-gcp
time="2019-10-24T15:15:12Z" level=info msg="restore completed" logSource="pkg/controller/restore_controller.go:465" restore=velero/restore-from-gcp.

But the strange thing is, it is able to restore all other componentes including the creation of Persistent Volumes but unable to restore the data.

What did you expect to happen:
A complete restoration including data.

In this case velero able to restore pods, services and persistent volume. But failed to restore data.

The output of the following commands will help us better understand what's going on:
(Pasting long output into a GitHub gist or other pastebin is fine.)

  • kubectl logs deployment/velero -n velero

  • velero backup describe <backupname> or kubectl get backup/<backupname> -n velero -o yaml

  • velero backup logs <backupname>

  • velero restore describe <restorename> or kubectl get restore/<restorename> -n velero -o yaml

  • velero restore logs <restorename>
    target-aws-cluster-logs.txt
    gke-source-cluster-logs.txt

Anything else you would like to add:
[Miscellaneous information that will assist in solving the issue.]

Attached more detailed logs.

Environment:
Both source & Target

  • Velero version (use velero version):
    Client:
    Version: v1.1.0
    Git commit: a357f21
    Server:
    Version: v1.1.0
  • Velero features (use velero client config get features):
    source

velero client config get features

Target

[ec2-user@ip-172-31-87-112 velero-v1.1.0-linux-amd64]$ velero client config get features
features:
[ec2-user@ip-172-31-87-112 velero-v1.1.0-linux-amd64]$

  • Kubernetes version (use kubectl version):
    AWS

[ec2-user@ip-172-31-87-112 velero-v1.1.0-linux-amd64]$ kubectl version
Client Version: version.Info{Major:"1", Minor:"16", GitVersion:"v1.16.2", GitCommit:"c97fe5036ef3df2967d086711e6c0c405941e14b", GitTreeState:"clean", BuildDate:"2019-10-15T19:18:23Z", GoVersion:"go1.12.10", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"14", GitVersion:"v1.14.6", GitCommit:"96fac5cd13a5dc064f7d9f4f23030a6aeface6cc", GitTreeState:"clean", BuildDate:"2019-08-19T11:05:16Z", GoVersion:"go1.12.9", Compiler:"gc", Platform:"linux/amd64"}
[ec2-user@ip-172-31-87-112 velero-v1.1.0-linux-amd64]$

GKE

[prassomanp_gmail_com@bastion ~]$ kubectl version
Client Version: version.Info{Major:"1", Minor:"16", GitVersion:"v1.16.1", GitCommit:"d647ddbd755faf07169599a625faf302ffc34458", GitTreeState:"clean", BuildDate:"2019-10-02T17:01:15Z", GoVersion:"go1.12.10", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"13+", GitVersion:"v1.13.10-gke.0", GitCommit:"569511c9540f78a94cc6a41d895c382d0946c11a", GitTreeState:"clean", BuildDate:"2019-08-21T23:28:44Z", GoVersion:"go1.11.13b4", Compiler:"gc", Platform:"linux/amd64"}
[prassomanp_gmail_com@bastion ~]$

  • Kubernetes installer & version:

  • Cloud provider or hardware configuration:
    Source: GKE
    Target: AWS (kops)

  • OS (e.g. from /etc/os-release):

@prassoma prassoma changed the title GKE to AWS (kops) - Failed data restoration. PO,SVC & PV restored. GKE to AWS - msg="unable to successfully complete restic restores of pod's volumes" error="timed out waiting for all PodVolumeRestores to complete" Oct 25, 2019
@skriss
Copy link
Contributor

skriss commented Oct 28, 2019

Do you have a storage class on the AWS cluster that has the same name as the one the backed-up PVs were on the GKE cluster? Or, did you set up a storage class mapping?

@skriss skriss added Question Restic Relates to the restic integration labels Oct 28, 2019
@prassoma
Copy link
Author

I did a storage class mapping using below yaml and it is able to create the PV and PVCs but unable to restore data from the source.

apiVersion: v1
kind: ConfigMap
metadata:
  # any name can be used; Velero uses the labels (below)
  # to identify it rather than the name
  name: change-storage-class-config
  # must be in the velero namespace
  namespace: velero
  # the below labels should be used verbatim in your
  # ConfigMap.
  labels:
    # this value-less label identifies the ConfigMap as
    # config for a plugin (i.e. the built-in change storage
    # class restore item action plugin)
    velero.io/plugin-config: ""
    # this label identifies the name and kind of plugin
    # that this ConfigMap is for.
    velero.io/change-storage-class: RestoreItemAction
data:
  # add 1+ key-value pairs here, where the key is the old
  # storage class name and the value is the new storage
  # class name.
  standard: gp2

@skriss
Copy link
Contributor

skriss commented Oct 30, 2019

Hmm, OK. I am also noticing that the pods you're ending up with in the AWS cluster have different names than the ones that were backed up from the GKE cluster.

GKE:

wordpress-dccb8668f-zzx65
wordpress-mysql-7d4fc77fdc-bx6rz

AWS:

wordpress-76b5d9f5c8-hfnjr
wordpress-mysql-66594fb556-fpmsp

This implies that the pods that were restored by Velero were subsequently deleted and replaced with new ones, likely by the deployment/replicaset controllers on the target cluster. This would pose a problem for the restic restore process. Unfortunately, I'm not sure why this would be happening. Let's look at some more info: can you provide the YAML for the deployments, replicasets, and pods on both the GKE and AWS clusters? It'd be easiest if you could put this into a gist using YAML formatting.

@prassoma
Copy link
Author

prassoma commented Nov 4, 2019

Hi Kriss,
Sorry about the delay. I have updated the yamls used for the deployment in source cluster (GKE) in gist, for your reference.
https://gist.github.com/prassoma/a49289fbb86471460440a0100a05e1e4

Please let me know in case if you are looking for any additional information.

@skriss
Copy link
Contributor

skriss commented Nov 14, 2019

Sorry for the delay on this. I'm seeing that in the ReplicaSets on the target cluster, you have:

NAME                                         DESIRED   CURRENT   READY   AGE
replicaset.apps/wordpress-76b5d9f5c8         1         1         1       112m
replicaset.apps/wordpress-dccb8668f          0         0         0       112m
replicaset.apps/wordpress-mysql-66594fb556   1         1         1       112m
replicaset.apps/wordpress-mysql-7d4fc77fdc   0         0         0       112m

So it looks like the ReplicaSets that Velero actually restored were very quickly replaced by new ones, implying a Deployment rollout happened (https://kubernetes.io/docs/concepts/workloads/controllers/deployment/#updating-a-deployment).

If you still have the environment around, could you provide the YAML for all of these ReplicaSets in the target cluster? According to the documentation, we should only get a Deployment rollout if the pod template spec changes, so I'm wondering if we can identify what changed.

@skriss
Copy link
Contributor

skriss commented Nov 14, 2019

Potentially similar cause to #1981

@skriss
Copy link
Contributor

skriss commented Dec 3, 2019

closing this out as inactive, feel free to reach out again as needed.

@skriss skriss closed this as completed Dec 3, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Restic Relates to the restic integration
Projects
None yet
Development

No branches or pull requests

2 participants