Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Restic volume not restored when using OpenShift (DeploymentConfig + ReplicationController) #1981

Closed
SebastienTolron opened this issue Oct 18, 2019 · 21 comments
Labels
Bug Restic Relates to the restic integration Reviewed Q2 2021

Comments

@SebastienTolron
Copy link

SebastienTolron commented Oct 18, 2019

What steps did you take and what happened:

I'm trying to restore a restic volume.

My backup got 2 volumes with 2 deployments

Backup


tools-bitbucket-backup-prv2   Completed                    2019-10-18 08:45:44 +0200 CEST   29d       cluster-tools      <none>


Persistent Volumes: <none included>
Restic Backups:
  Completed:
    ok101-bitbucket-pr/bitbucket-postgresql-5-vvwsm: bitbucket-postgresql-data
    ok101-bitbucket-pr/bitbucket-server-14-ddrpp: bitbucket-server-data

When I make a restore from this backup , The postgres pod is restore properly but not the bitbucket server

tools-bitbucket-backup-prv2-20191018090027   tools-bitbucket-backup-prv2   InProgress   0          0        2019-10-18 09:00:27 +0200 CEST   <none>
 velero restore describe tools-bitbucket-backup-prv2-20191018090027

Restic Restores:
  Completed:
    ok101-bitbucket-pr/bitbucket-postgresql-5-vvwsm: bitbucket-postgresql-data
  New:
    ok101-bitbucket-pr/bitbucket-server-14-ddrpp: bitbucket-server-data

The init container is not created on the bitbucket-server pod , so the restic stay stuck in "New" phase but the pod is created and running. It shouldn't

 kubectl get po
NAME                                        READY     STATUS             RESTARTS   AGE
bitbucket-server-15-glk7c                   1/1       Running            0          5m

*** Restic log ***

time="2019-10-18T07:00:48Z" level=debug msg="Pod is not running restic-wait init container, not enqueuing restores for pod" controller=pod-volume-restore key=ok24-velero/velero-7f5d784896-l66m7 logSource="pkg/controller/pod_volume_restore_controller.go:170"
time="2019-10-18T07:00:48Z" level=debug msg="Pod is not running restic-wait init container, not enqueuing restores for pod" controller=pod-volume-restore key=ok24-velero/velero-7f5d784896-l66m7 logSource="pkg/controller/pod_volume_restore_controller.go:170"
time="2019-10-18T07:00:59Z" level=debug msg="Restore's pod ok101-bitbucket-pr/bitbucket-postgresql-5-vvwsm not found, not enqueueing." controller=pod-volume-restore error="pod \"bitbucket-postgresql-5-vvwsm\" not found" logSource="pkg/controller/pod_volume_restore_controller.go:137" name=tools-bitbucket-backup-prv2-20191018090027-6dkgf namespace=ok24-velero restore=ok24-velero/tools-bitbucket-backup-prv2-20191018090027
time="2019-10-18T07:00:59Z" level=debug msg="Restore's pod ok101-bitbucket-pr/bitbucket-server-14-ddrpp not found, not enqueueing." controller=pod-volume-restore error="pod \"bitbucket-server-14-ddrpp\" not found" logSource="pkg/controller/pod_volume_restore_controller.go:137" name=tools-bitbucket-backup-prv2-20191018090027-gghc8 namespace=ok24-velero restore=ok24-velero/tools-bitbucket-backup-prv2-20191018090027
time="2019-10-18T07:01:01Z" level=debug msg="Pod is not running restic-wait init container, not enqueuing restores for pod" controller=pod-volume-restore key=ok101-bitbucket-pr/bitbucket-server-15-deploy logSource="pkg/controller/pod_volume_restore_controller.go:170"
time="2019-10-18T07:01:01Z" level=debug msg="Pod is not running restic-wait init container, not enqueuing restores for pod" controller=pod-volume-restore key=ok101-bitbucket-pr/bitbucket-server-15-deploy logSource="pkg/controller/pod_volume_restore_controller.go:170"
time="2019-10-18T07:01:03Z" level=debug msg="Pod is not running restic-wait init container, not enqueuing restores for pod" controller=pod-volume-restore key=ok101-bitbucket-pr/bitbucket-server-15-deploy logSource="pkg/controller/pod_volume_restore_controller.go:170"
time="2019-10-18T07:01:09Z" level=debug msg="Pod is not running restic-wait init container, not enqueuing restores for pod" controller=pod-volume-restore key=ok101-bitbucket-pr/bitbucket-server-15-glk7c logSource="pkg/controller/pod_volume_restore_controller.go:170"
time="2019-10-18T07:01:09Z" level=debug msg="Pod is not running restic-wait init container, not enqueuing restores for pod" controller=pod-volume-restore key=ok101-bitbucket-pr/bitbucket-server-15-glk7c logSource="pkg/controller/pod_volume_restore_controller.go:170"
time="2019-10-18T07:01:11Z" level=debug msg="Restore is not new, not enqueuing" controller=pod-volume-restore logSource="pkg/controller/pod_volume_restore_controller.go:131" name=tools-bitbucket-backup-prv2-20191018090027-6dkgf namespace=ok24-velero restore=ok24-velero/tools-bitbucket-backup-prv2-20191018090027
time="2019-10-18T07:01:12Z" level=debug msg="Restore is not new, not enqueuing" controller=pod-volume-restore logSource="pkg/controller/pod_volume_restore_controller.go:131" name=tools-bitbucket-backup-prv2-20191018090027-6dkgf namespace=ok24-velero restore=ok24-velero/tools-bitbucket-backup-prv2-20191018090027
time="2019-10-18T07:01:13Z" level=debug msg="Pod is not running restic-wait init container, not enqueuing restores for pod" controller=pod-volume-restore key=ok101-bitbucket-pr/bitbucket-server-15-glk7c logSource="pkg/controller/pod_volume_restore_controller.go:170"
time="2019-10-18T07:01:14Z" level=debug msg="Pod is not running restic-wait init container, not enqueuing restores for pod" controller=pod-volume-restore key=ok101-bitbucket-pr/bitbucket-server-15-deploy logSource="pkg/controller/pod_volume_restore_controller.go:170"
time="2019-10-18T07:01:14Z" level=debug msg="Pod is not running restic-wait init container, not enqueuing restores for pod" controller=pod-volume-restore key=ok101-bitbucket-pr/bitbucket-server-15-deploy logSource="pkg/controller/pod_volume_restore_controller.go:170"
time="2019-10-18T07:01:19Z" level=debug msg="Pod is not running restic-wait init container, not enqueuing restores for pod" controller=pod-volume-restore key=ok24-velero/velero-7f5d784896-l66m7 logSource="pkg/controller/pod_volume_restore_controller.go:170"
time="2019-10-18T07:01:20Z" level=debug msg="Pod is not running restic-wait init container, not enqueuing restores for pod" controller=pod-volume-restore key=ok24-velero/velero-7f5d784896-l66m7 logSource="pkg/controller/pod_volume_restore_controller.go:170"
time="2019-10-18T07:01:22Z" level=debug msg="Restore is not new, not enqueuing" controller=pod-volume-restore logSource="pkg/controller/pod_volume_restore_controller.go:131" name=tools-bitbucket-backup-prv2-20191018090027-6dkgf namespace=ok24-velero restore=ok24-velero/tools-bitbucket-backup-prv2-20191018090027
time="2019-10-18T07:01:31Z" level=debug msg="Restore is not new, not enqueuing" controller=pod-volume-restore logSource="pkg/controller/pod_volume_restore_controller.go:131" name=tools-bitbucket-backup-prv2-20191018090027-6dkgf namespace=ok24-velero restore=ok24-velero/tools-bitbucket-backup-prv2-20191018090027
W1018 07:05:36.261218       1 reflector.go:302] github.com/vmware-tanzu/velero/pkg/cmd/cli/restic/server.go:197: watch of *v1.Secret ended with: The resourceVersion for the provided watch is too old.
W1018 07:05:52.312131       1 reflector.go:302] github.com/vmware-tanzu/velero/pkg/generated/informers/externalversions/factory.go:117: watch of *v1.PodVolumeBackup ended with: The resourceVersion for the provided watch is too old.

*** Velero Log ***

https://gist.github.com/Stolr/02ee7e4ee7d662b94df52de93f953ab3

*** PodVolumeRestore ***

kubectl -n ok24-velero get podvolumerestores -l velero.io/restore-name=tools-bitbucket-backup-prv2-20191018090027  -o yaml
apiVersion: v1
items:
- apiVersion: velero.io/v1
  kind: PodVolumeRestore
  metadata:
    creationTimestamp: 2019-10-18T07:00:59Z
    generateName: tools-bitbucket-backup-prv2-20191018090027-
    generation: 1
    labels:
      velero.io/pod-uid: 0a33afb5-f175-11e9-967b-005056b9b6b7
      velero.io/restore-name: tools-bitbucket-backup-prv2-20191018090027
      velero.io/restore-uid: f7012bc8-f174-11e9-bf99-005056b9c7f4
    name: tools-bitbucket-backup-prv2-20191018090027-6dkgf
    namespace: ok24-velero
    ownerReferences:
    - apiVersion: velero.io/v1
      controller: true
      kind: Restore
      name: tools-bitbucket-backup-prv2-20191018090027
      uid: f7012bc8-f174-11e9-bf99-005056b9c7f4
    resourceVersion: "853596"
    selfLink: /apis/velero.io/v1/namespaces/ok24-velero/podvolumerestores/tools-bitbucket-backup-prv2-20191018090027-6dkgf
    uid: 0a35ffec-f175-11e9-967b-005056b9b6b7
  spec:
    backupStorageLocation: cluster-tools
    pod:
      kind: Pod
      name: bitbucket-postgresql-5-vvwsm
      namespace: ok101-bitbucket-pr
      uid: 0a33afb5-f175-11e9-967b-005056b9b6b7
    repoIdentifier: s3:http://oca-miniolb.oca.local/velero/tools/restic/ok101-bitbucket-pr
    snapshotID: 4bd49d6e
    volume: bitbucket-postgresql-data
  status:
    completionTimestamp: 2019-10-18T07:01:31Z
    message: ""
    phase: Completed
    progress:
      bytesDone: 83536468
      totalBytes: 83536468
    startTimestamp: 2019-10-18T07:01:10Z
- apiVersion: velero.io/v1
  kind: PodVolumeRestore
  metadata:
    creationTimestamp: 2019-10-18T07:00:59Z
    generateName: tools-bitbucket-backup-prv2-20191018090027-
    generation: 1
    labels:
      velero.io/pod-uid: 0a3b5638-f175-11e9-967b-005056b9b6b7
      velero.io/restore-name: tools-bitbucket-backup-prv2-20191018090027
      velero.io/restore-uid: f7012bc8-f174-11e9-bf99-005056b9c7f4
    name: tools-bitbucket-backup-prv2-20191018090027-gghc8
    namespace: ok24-velero
    ownerReferences:
    - apiVersion: velero.io/v1
      controller: true
      kind: Restore
      name: tools-bitbucket-backup-prv2-20191018090027
      uid: f7012bc8-f174-11e9-bf99-005056b9c7f4
    resourceVersion: "853188"
    selfLink: /apis/velero.io/v1/namespaces/ok24-velero/podvolumerestores/tools-bitbucket-backup-prv2-20191018090027-gghc8
    uid: 0a3d1615-f175-11e9-967b-005056b9b6b7
  spec:
    backupStorageLocation: cluster-tools
    pod:
      kind: Pod
      name: bitbucket-server-14-ddrpp
      namespace: ok101-bitbucket-pr
      uid: 0a3b5638-f175-11e9-967b-005056b9b6b7
    repoIdentifier: s3:http://oca-miniolb.oca.local/velero/tools/restic/ok101-bitbucket-pr
    snapshotID: e5a5986e
    volume: bitbucket-server-data
  status:
    completionTimestamp: null
    message: ""
    phase: ""
    startTimestamp: null
kind: List
metadata:
  resourceVersion: ""
  selfLink: ""

** Environment **

velero version
Client:
Version: v1.1.0
Git commit: a357f21
Server:
Version: v1.1.0

oc v3.11.0+0cbc58b
kubernetes v1.11.0+d4cacc0
openshift v3.11.0+bdd37ad-314
kubernetes v1.11.0+d4cacc0

The namespace does not exist before the restore so every resources is new on the cluster

Any idea ?

Thanks a lot

@skriss
Copy link
Contributor

skriss commented Oct 18, 2019

hmm, based on the following lines:

W1018 07:05:36.261218       1 reflector.go:302] github.com/vmware-tanzu/velero/pkg/cmd/cli/restic/server.go:197: watch of *v1.Secret ended with: The resourceVersion for the provided watch is too old.
W1018 07:05:52.312131       1 reflector.go:302] github.com/vmware-tanzu/velero/pkg/generated/informers/externalversions/factory.go:117: watch of *v1.PodVolumeBackup ended with: The resourceVersion for the provided watch is too old.

it looks like there might be an issue with the informer caches.

Could you try deleting all of the restic daemonset pods, letting them get re-created, and then trying another restore? (you'll want to delete the target namespace as well before kicking off the new restore)

@SebastienTolron
Copy link
Author

Hi @skriss

Thanks for the answer.

I already try this.

I'm making a restore on another cluster. It is a fresh one so there should not be any cache ?

Can this be the issue ? ( Restoring to another cluster)

I'm not able to try it until monday, but not sure it will fix the issue since I already try on a fresh instance.

Any other idea ? :)

@SebastienTolron
Copy link
Author

SebastienTolron commented Oct 21, 2019

Hey ,

So I made a new fresh install to test this and make sure this is not a cache issue.

Here is the whole procedure to help you debugging ( I install the restic DS before velero because I adapted for okd , this can be the issue maybe.)

Instalation on Cluster Tools && Cluster Tools-B :

kubectl create ns velero
namespace/velero created

oc annotate namespace velero openshift.io/node-selector=""
namespace/velero annotated

oc adm policy add-scc-to-user privileged system:serviceaccount:velero:velero
scc "privileged" added to: ["system:serviceaccount:velero:velero"]

oc apply -f serviceAccount.yaml
serviceaccount/velero created

kubectl apply -f daemonSetrestic.yaml
daemonset.extensions/restic created

velero install \
    --provider aws \
    --bucket velero \
    --use-restic \
    --secret-file ./credentials-velero  \
    --use-volume-snapshots=false \
    --backup-location-config region=minio,s3ForcePathStyle="true",s3Url=http://oca-miniolb.oca.local/ 
	
CustomResourceDefinition/schedules.velero.io: attempting to create resource
CustomResourceDefinition/schedules.velero.io: created
CustomResourceDefinition/deletebackuprequests.velero.io: attempting to create resource
CustomResourceDefinition/deletebackuprequests.velero.io: created
CustomResourceDefinition/podvolumerestores.velero.io: attempting to create resource
CustomResourceDefinition/podvolumerestores.velero.io: created
CustomResourceDefinition/volumesnapshotlocations.velero.io: attempting to create resource
CustomResourceDefinition/volumesnapshotlocations.velero.io: created
CustomResourceDefinition/backups.velero.io: attempting to create resource
CustomResourceDefinition/backups.velero.io: created
CustomResourceDefinition/downloadrequests.velero.io: attempting to create resource
CustomResourceDefinition/downloadrequests.velero.io: created
CustomResourceDefinition/podvolumebackups.velero.io: attempting to create resource
CustomResourceDefinition/podvolumebackups.velero.io: created
CustomResourceDefinition/resticrepositories.velero.io: attempting to create resource
CustomResourceDefinition/resticrepositories.velero.io: created
CustomResourceDefinition/backupstoragelocations.velero.io: attempting to create resource
CustomResourceDefinition/backupstoragelocations.velero.io: created
CustomResourceDefinition/serverstatusrequests.velero.io: attempting to create resource
CustomResourceDefinition/serverstatusrequests.velero.io: created
CustomResourceDefinition/restores.velero.io: attempting to create resource
CustomResourceDefinition/restores.velero.io: created
Waiting for resources to be ready in cluster...
Namespace/velero: attempting to create resource
Namespace/velero: already exists, proceeding
Namespace/velero: created
ClusterRoleBinding/velero: attempting to create resource
ClusterRoleBinding/velero: created
ServiceAccount/velero: attempting to create resource
ServiceAccount/velero: already exists, proceeding
ServiceAccount/velero: created
Secret/cloud-credentials: attempting to create resource
Secret/cloud-credentials: created
BackupStorageLocation/default: attempting to create resource
BackupStorageLocation/default: created
Deployment/velero: attempting to create resource
Deployment/velero: created
DaemonSet/restic: attempting to create resource
DaemonSet/restic: already exists, proceeding
DaemonSet/restic: created
Velero is installed! ⛵ Use 'kubectl logs deployment/velero -n velero' to view the status.	

Cluster Tools Backup location

apiVersion: velero.io/v1
kind: BackupStorageLocation
metadata:
  creationTimestamp: 2019-10-21T06:14:03Z
  generation: 1
  labels:
    component: velero
  name: default
  namespace: velero
  resourceVersion: "48443311"
  selfLink: /apis/velero.io/v1/namespaces/velero/backupstoragelocations/default
  uid: fb1ae7ac-f3c9-11e9-843a-005056b9cf2b
spec:
  config:
    region: minio
    s3ForcePathStyle: "true"
    s3Url: http://oca-miniolb.oca.local/
  objectStorage:
    bucket: velero
    prefix: "tools"
  provider: aws

Cluster Tools-B Backup location

apiVersion: velero.io/v1
kind: BackupStorageLocation
metadata:
  creationTimestamp: 2019-10-21T06:16:31Z
  generation: 1
  labels:
    component: velero
  name: default
  namespace: velero
  resourceVersion: "1743001"
  selfLink: /apis/velero.io/v1/namespaces/velero/backupstoragelocations/default
  uid: 53050648-f3ca-11e9-8991-005056b92845
spec:
  config:
    region: minio
    s3ForcePathStyle: "true"
    s3Url: http://oca-miniolb.oca.local/
  objectStorage:
    bucket: velero
    prefix: "tools-b"
  provider: aws
  

 velero backup-location create cluster-tools \
    --provider aws \
    --bucket velero \
    --access-mode ReadOnly  \
    --config region=minio,s3ForcePathStyle="true",s3Url=http://oca-miniolb.oca.local/

And then I edited BackupLocation cluster-tools to add "tools" Prefix.

Now everything is running fine :

 kubectl get po
NAME                     READY     STATUS    RESTARTS   AGE
restic-2s7n8             1/1       Running   0          6m
restic-2zcr7             1/1       Running   0          6m
restic-7wbrt             1/1       Running   0          6m
restic-8zn8n             1/1       Running   0          6m
restic-cfq77             1/1       Running   0          6m
restic-djvrj             1/1       Running   0          6m
restic-kpnvt             1/1       Running   0          6m
restic-n58w2             1/1       Running   0          6m
restic-n5c6x             1/1       Running   0          6m
restic-ssvpp             1/1       Running   0          6m
restic-wjxj7             1/1       Running   0          6m
restic-wsj94             1/1       Running   0          6m
restic-xgxjt             1/1       Running   0          6m
restic-zvfvj             1/1       Running   0          6m
velero-df87fbb89-m2tbh   1/1       Running   2          6m

Cluster Tools :

Creating the backup

kubectl -n ok101-bitbucket-pr annotate pod/bitbucket-postgresql-5-vvwsm backup.velero.io/backup-volumes=bitbucket-postgresql-data
kubectl -n ok101-bitbucket-pr annotate pod/bitbucket-server-14-ddrpp backup.velero.io/backup-volumes=bitbucket-server-data

velero backup create tools-bitbucket-backup --include-namespaces=ok101-bitbucket-pr

Backup get

velero backup get
NAME STATUS CREATED EXPIRES STORAGE LOCATION SELECTOR
tools-bitbucket-backup PartiallyFailed (2 errors) 2019-10-21 08:23:26 +0200 CEST 29d default

Velero Logs :

https://gist.github.com/Stolr/23d0dd11b301150ccb336a12b77107a1

Backup description

velero backup describe tools-bitbucket-backup --details

https://gist.github.com/Stolr/9b862178df8f951cbd9b50357bd502c8

Backup logs

velero backup logs tools-bitbucket-backup

https://gist.github.com/Stolr/293051c52536541fec55f924f76386be

I can see there is 2 error, but it says my restic are completed. It should not be relevant. This is probably due to some pods not correct in that namespace. First time didn't have that error but the restic issue was here.

Now , On Cluster Tools-B

velero backup get
NAME                     STATUS                       CREATED                          EXPIRES   STORAGE LOCATION   SELECTOR
tools-bitbucket-backup   PartiallyFailed (2 errors)   2019-10-21 08:23:26 +0200 CEST   29d       cluster-tools      <none>

velero restore create --include-namespaces=ok101-bitbucket-pr --from-backup tools-bitbucket-backup

Restore request "tools-bitbucket-backup-20191021083745" submitted successfully.
Run `velero restore describe tools-bitbucket-backup-20191021083745` or `velero restore logs tools-bitbucket-backup-20191021083745` for more details.

Same issue :

The restore Stay in progress because of the restic not restored

 velero restore get
NAME                                    BACKUP                   STATUS       WARNINGS   ERRORS   CREATED                          SELECTOR
tools-bitbucket-backup-20191021083745   tools-bitbucket-backup   InProgress   0          0        2019-10-21 08:37:45 +0200 CEST   <none>

 kubectl get po -n ok101-bitbucket-pr
NAME                                              READY     STATUS            RESTARTS   AGE
bitbucket-postgresql-5-vvwsm                      0/1       PodInitializing   0          39s
bitbucket-pr-data-backup-1571351700-gfhf4         0/1       Pending           0          39s
bitbucket-pr-data-backup-1571438100-jb2dv         0/1       Pending           0          39s
bitbucket-pr-postgresql-backup-1571436000-rtsf7   0/1       Pending           0          39s
bitbucket-server-15-6528q                         1/1       Running           0          34s

Restic Logs

https://gist.github.com/Stolr/dabac536a5235b87ecd184045ab2e7b5

Velero Logs

https://gist.github.com/Stolr/5dfc2f7fea9c63f0ddbd61d9276ac984

Restore Logs

No available
Logs for restore "tools-bitbucket-backup-20191021083745" are not available until it's finished processing. Please wait until the restore has a phase of Completed or Failed and try again.

PodVolumeRestore

kubectl -n velero get podvolumerestores -l velero.io/restore-name=tools-bitbucket-backup-20191021083745 -o yaml
apiVersion: v1
items:
- apiVersion: velero.io/v1
  kind: PodVolumeRestore
  metadata:
    creationTimestamp: 2019-10-21T06:37:46Z
    generateName: tools-bitbucket-backup-20191021083745-
    generation: 1
    labels:
      velero.io/pod-uid: 4ae0fee4-f3cd-11e9-bf99-005056b9c7f4
      velero.io/restore-name: tools-bitbucket-backup-20191021083745
      velero.io/restore-uid: 4a83ec78-f3cd-11e9-bf99-005056b9c7f4
    name: tools-bitbucket-backup-20191021083745-9pm46
    namespace: velero
    ownerReferences:
    - apiVersion: velero.io/v1
      controller: true
      kind: Restore
      name: tools-bitbucket-backup-20191021083745
      uid: 4a83ec78-f3cd-11e9-bf99-005056b9c7f4
    resourceVersion: "1747867"
    selfLink: /apis/velero.io/v1/namespaces/velero/podvolumerestores/tools-bitbucket-backup-20191021083745-9pm46
    uid: 4b5e4bfa-f3cd-11e9-bf99-005056b9c7f4
  spec:
    backupStorageLocation: cluster-tools
    pod:
      kind: Pod
      name: bitbucket-server-14-ddrpp
      namespace: ok101-bitbucket-pr
      uid: 4ae0fee4-f3cd-11e9-bf99-005056b9c7f4
    repoIdentifier: s3:http://oca-miniolb.oca.local/velero/tools/restic/ok101-bitbucket-pr
    snapshotID: d17de56d
    volume: bitbucket-server-data
  status:
    completionTimestamp: null
    message: ""
    phase: ""
    startTimestamp: null
- apiVersion: velero.io/v1
  kind: PodVolumeRestore
  metadata:
    creationTimestamp: 2019-10-21T06:37:46Z
    generateName: tools-bitbucket-backup-20191021083745-
    generation: 1
    labels:
      velero.io/pod-uid: 4adbf402-f3cd-11e9-bf99-005056b9c7f4
      velero.io/restore-name: tools-bitbucket-backup-20191021083745
      velero.io/restore-uid: 4a83ec78-f3cd-11e9-bf99-005056b9c7f4
    name: tools-bitbucket-backup-20191021083745-mlrmd
    namespace: velero
    ownerReferences:
    - apiVersion: velero.io/v1
      controller: true
      kind: Restore
      name: tools-bitbucket-backup-20191021083745
      uid: 4a83ec78-f3cd-11e9-bf99-005056b9c7f4
    resourceVersion: "1748261"
    selfLink: /apis/velero.io/v1/namespaces/velero/podvolumerestores/tools-bitbucket-backup-20191021083745-mlrmd
    uid: 4b5ecc1e-f3cd-11e9-bf99-005056b9c7f4
  spec:
    backupStorageLocation: cluster-tools
    pod:
      kind: Pod
      name: bitbucket-postgresql-5-vvwsm
      namespace: ok101-bitbucket-pr
      uid: 4adbf402-f3cd-11e9-bf99-005056b9c7f4
    repoIdentifier: s3:http://oca-miniolb.oca.local/velero/tools/restic/ok101-bitbucket-pr
    snapshotID: 01fc7cc8
    volume: bitbucket-postgresql-data
  status:
    completionTimestamp: 2019-10-21T06:38:20Z
    message: ""
    phase: Completed
    startTimestamp: 2019-10-21T06:37:59Z
kind: List
metadata:
  resourceVersion: ""
  selfLink: ""

My bitbucket data is not restored. No init container is created. But the postgres one is working as espected.

Do you find something in all theses logs that can explain this ?

Thanks for your help !

@skriss
Copy link
Contributor

skriss commented Oct 21, 2019

@Stolr i'm not exactly sure what's going on, but I do see that during the backup, the pod that's being backed up is bitbucket-server-14-ddrpp, and then during/after restore, you end up with pod bitbucket-server-15-6528q. I do see in the Velero server log that during the restore, pod bitbucket-server-14-ddrpp is restored, but it seems like it's probably being deleted and replaced with bitbucket-server-15-6528q.

I'm not super-familiar with OpenShift's deploymentconfigs and (apparently) their use of replication controllers, but in plain vanilla Kubernetes, the way this would work is we'd restore pod "14", then restore the replicaset controlling it, and that replicaset would see pod "14" and "adopt" it. It seems like possibly, something about the deploymentconfig/replicationcontroller is preventing this "adoption" from happening, and triggering the creating of a new pod "15".

Does this ring any bells for you? Maybe we can figure it out together :)

@yashbhutwala
Copy link

@Stolr @skriss sorry to bump into conversation, just a thought. Instead of annotating the pod itself, can you try annotating the pod template spec of the parent controller, i.e.: Deployment or ReplicationController?

@SebastienTolron
Copy link
Author

@skriss Wow thanks !!
You are right , the deployment number is not the same.

For some reason , openshift trigger a new deploy. Probably because of all resource beeing restore. No way to return on the 14 even with a rollback.

I'm not super familiar also with Openshift
I got rid of that deploymentConfig ( no point using it ) , and adapt everything using normal deployment.

Everything is working as espected using deployment.

@yashbhutwala : I Might try this when i will be able. Thanks for your answer.

Thanks both for your help. Since this issue is related to Openshift , you can close the issue if you want or rename it.

Best Regards

Thanks again for helping me getting through this.

@skriss
Copy link
Contributor

skriss commented Oct 22, 2019

@sseago @dymurray do you guys have any thoughts on what's going on here? (#1981 (comment))

@skriss skriss changed the title Restic volume not restored Restic volume not restored when using OpenShift (DeploymentConfig + ReplicationController) Oct 22, 2019
@skriss skriss added Restic Relates to the restic integration Bug Waiting for info labels Oct 22, 2019
@sseago
Copy link
Collaborator

sseago commented Oct 22, 2019

Off the top of my head, I'm not sure what's going on, although I haven't looked at the logs in detail yet. The redeployment of a new pod may well be affecting things here, since the new pod probably won't have the restic annotation. For the work my group has been doing, we actually do a two-phase backup/restore, in part to eliminate as much complexity as possible from the environment Restic is working in. We create a full backup without any restic annotations, and then a limited backup with just the PVs/PVCs and pods which mount them with the restic annotations. Then, on restore we first restore the restic backup (pods only, no deployments, deploymentconfigs, etc.) -- this is when the restic copies happen. Then those restored pods are deleted and we do the full restore (without restic annotations). I don't know that all of this is necessary for a basic backup/restore -- in our case we're using it for app migration from one cluster to another, with the possibility of running the restic/PV migration more than once before the final migration. In any case, if you're restoring deploymentconfigs which are then rolling out new pods post-restore, that could definitely interfere with restic. I don't know what the appropriate general-purpose answer is here -- our approach has been for a very specific migration use case. I wonder whether the same issue comes up with non-OpenShift resources. Daemonsets, Deployments, etc. Annotating the pod template spec, as suggested above (in addition to annotating the pod) may be the way to go here. I"m not sure whether it will resolve this issue completely or not, though.

@dymurray
Copy link
Contributor

To add on to what Scott said, yes we hit this same problem very early on. This is a problem that extends beyond OCP specific restores, my understanding is that any pod which is managed by another resource faces this risk.

If a pod is managed by another resource the restic restore will generally fail since both the pod and the managing resource is restored which causes the initial pod (with the restic annotation) to be overwritten. I could have sworn there was an open issue on this but I can't seem to find it right now.

@yashbhutwala
Copy link

@dymurray not sure if this covers all of what you say, but I made an issue a month ago facing a similar issue. See: #1919

@skriss
Copy link
Contributor

skriss commented Feb 25, 2020

If a pod is managed by another resource the restic restore will generally fail since both the pod and the managing resource is restored which causes the initial pod (with the restic annotation) to be overwritten. I could have sworn there was an open issue on this but I can't seem to find it right now.

We haven't seen this, at least not with pods managed by replicasets/deployments. Per my comment (#1981 (comment)), during a restic restore, we first restore the pod & trigger a restic restore, then restore the owning replicaset and deployment. The pod is successfully "adopted" by the replicaset, since the pod's spec matches the pod template spec from the replicaset.

If that behavior were different, then I agree it would likely cause problems with restic restores, which seems to be what we're seeing here. Can you shed any more light onto why the DeploymentConfig restore is triggering the creation of a new pod, rather than adopting the existing one?

@sseago
Copy link
Collaborator

sseago commented Feb 28, 2020

From what I've seen with DeploymentConfigs they don't always trigger new pods, but sometimes they do. I believe they actually do (initially) adopt the restored pod, as expected, but if there's a ConfigChange trigger registered, then the restore event on the DeploymentConfig will sometimes trigger that if the restore process looks like a configuration change. Most of my experience here is in restoring resources to a different cluster than the backup came from, with some spec params modified by a plugin on restore ("image" references, for example, if the image is located in an in-cluster registry). The pod as restored will run for a short amount of time, but will terminate as soon as the ConfigChange triggered replacement is ready. Most recently, this week I've restored a couple DeploymentConfigs to the same cluster as the backup was run in, and in that case I did not see a replacement being created post-restore.

@dymurray
Copy link
Contributor

So I spent some time digging into this, and based on what I've learned I can say that yes the method Velero is currently taking with restic restores has it's shortcomings. Currently, we are lucky that a deployment doesn't trigger a new generation of the pod in 99% of the restore use cases. If you specifically trigger a redeploy during the restic restore then things will break as shown in #1919 .

With deploymentconfigs, there are a number of triggers you can set which will trigger the redeploy of a pod, but the bigger issue is that currently with DCs the pod is restored first with the restic annotation and then later adopted to the DC controller and redeployed wiping the annotation out. If a plugin is used to not restore a pod if it's managed by a DC in conjunction with placing the annotation on the DC pod template spec then the restic restore has a good chance of succeeding, but the same concern that Kubernetes could trigger a new deployment for deployments and deploymentconfigs during restore is a larger problem that needs to be solved.

@skriss
Copy link
Contributor

skriss commented Mar 6, 2020

open to ideas on how to improve this. the data populator KEP that's making the rounds upstream may be relevant/useful, though AFAIK it's only for PVs, not any pod volume.

@dejwsz
Copy link

dejwsz commented May 29, 2020

Well, I had just the same problem! Restore completed, no errors in logs but the PV is completely empty! Sucks.

@dejwsz
Copy link

dejwsz commented May 29, 2020

I wanted to restore only PVC with PV itself and did it:

velero restore create --from-backup daily-20200528020046 --include-namespaces test-project --include-resources persistentvolumeclaims,persistentvolumes --restore-volumes=true

Completed, no errors. But there is no data at all.
I did not suspect that. Is there any way to make it working with restic?
I have DeploymentConfig but "replicas" is set to 0 and I removed ConfigChange from triggers.

@dejwsz
Copy link

dejwsz commented May 29, 2020

What is interesting I tested it before but only after removing a whole project and then it was ok and even data was there. So it works only during restoring of whole projects? It is not possible to restore just a volume?

@dejwsz
Copy link

dejwsz commented May 29, 2020

I can confirm - I can restore volumes only restoring a whole project. So a whole namespace - it must be empty.
You cannot restore volumes if there are some objects like deployments or other things. You cannot restore PVC with PV themselves separately using restic.

So in my case, I needed to restore to a mapped temporary namespace. Then go there and scale everything down. Then spin a new POD just to attach PV and rsync data out of the volume to my host. Then I deleted temporary namespace. I run the helper POD again in my original project and I needed there to connect to PV and rsync all the data there. Later I did chown with the user ID of the container. Removed helper POD and then finally scale up the deployment. And it worked and data was there from the backup snapshot. But the process is very inconvenient in such cases, very clumsy.

@galindro
Copy link

galindro commented Aug 19, 2020

I'm facing this issue when I'm restoring a backup of prometheus-operator. My restore tests was done in the same cluster where backup lives but in another namespace. The production application was still live on it's own namespace.

My cluster is running in EKS. It's version is 1.16.

There are three PVs that should be backed-up: grafana, prometheus and alertmanager. Prometheus and grafana PVs could be restored without problems but alertmanager PV no, because alertmanager Statefulset is dynamically created by an Alertmanager object (from monitoring.coreos.com/v1 API). I can see in velero logs that it could successfully restore the alertmanager pod and could inject the restic-wait container on it. But, when Alertmanager object is restored, it creates the Statefulset which replaces the pod.

This is the velero logs that proves the restic-wait container creation on alertmanager pod:

time="2020-08-18T11:23:39Z" level=info msg="Restoring resource 'pods' into namespace 'monitoring-restored'" logSource="pkg/restore/restore.go:702" restore=velero/monitoring
time="2020-08-18T11:23:39Z" level=info msg="Getting client for /v1, Kind=Pod" logSource="pkg/restore/restore.go:746" restore=velero/monitoring
time="2020-08-18T11:23:39Z" level=info msg="Executing item action for pods" logSource="pkg/restore/restore.go:964" restore=velero/monitoring
time="2020-08-18T11:23:39Z" level=info msg="Executing AddPVCFromPodAction" cmd=/velero logSource="pkg/restore/add_pvc_from_pod_action.go:44" pluginName=velero restore=velero/monitoring
time="2020-08-18T11:23:39Z" level=info msg="Adding PVC monitoring/alertmanager-prometheus-operator-alertmanager-db-alertmanager-prometheus-operator-alertmanager-0 as an additional item to restore" cmd=/velero logSource="pkg/restore/add_pvc_from_pod_action.go:58" pluginName=velero restore=velero/monitoring
time="2020-08-18T11:23:39Z" level=info msg="Skipping persistentvolumeclaims/monitoring-restored/alertmanager-prometheus-operator-alertmanager-db-alertmanager-prometheus-operator-alertmanager-0 because it's already been restored." logSource="pkg/restore/restore.go:844" restore=velero/monitoring
time="2020-08-18T11:23:39Z" level=info msg="Executing item action for pods" logSource="pkg/restore/restore.go:964" restore=velero/monitoring
time="2020-08-18T11:23:39Z" level=info msg="Executing item action for pods" logSource="pkg/restore/restore.go:964" restore=velero/monitoring
time="2020-08-18T11:23:39Z" level=info msg="Executing ResticRestoreAction" cmd=/velero logSource="pkg/restore/restic_restore_action.go:69" pluginName=velero restore=velero/monitoring
time="2020-08-18T11:23:39Z" level=info msg="Restic backups for pod found" cmd=/velero logSource="pkg/restore/restic_restore_action.go:95" pluginName=velero pod=monitoring/alertmanager-prometheus-operator-alertmanager-0 restore=velero/monitoring
time="2020-08-18T11:23:39Z" level=debug msg="Getting plugin config" cmd=/velero logSource="pkg/restore/restic_restore_action.go:99" pluginName=velero pod=monitoring/alertmanager-prometheus-operator-alertmanager-0 restore=velero/monitoring
time="2020-08-18T11:23:39Z" level=debug msg="No config found for plugin" cmd=/velero logSource="pkg/restore/restic_restore_action.go:160" pluginName=velero pod=monitoring/alertmanager-prometheus-operator-alertmanager-0 restore=velero/monitoring
time="2020-08-18T11:23:39Z" level=info msg="Using image \"velero/velero-restic-restore-helper:v1.4.2\"" cmd=/velero logSource="pkg/restore/restic_restore_action.go:106" pluginName=velero pod=monitoring/alertmanager-prometheus-operator-alertmanager-0 restore=velero/monitoring
time="2020-08-18T11:23:39Z" level=debug msg="No config found for plugin" cmd=/velero logSource="pkg/restore/restic_restore_action.go:195" pluginName=velero pod=monitoring/alertmanager-prometheus-operator-alertmanager-0 restore=velero/monitoring
time="2020-08-18T11:23:39Z" level=debug msg="No config found for plugin" cmd=/velero logSource="pkg/restore/restic_restore_action.go:206" pluginName=velero pod=monitoring/alertmanager-prometheus-operator-alertmanager-0 restore=velero/monitoring
time="2020-08-18T11:23:39Z" level=info msg="Done executing ResticRestoreAction" cmd=/velero logSource="pkg/restore/restic_restore_action.go:155" pluginName=velero restore=velero/monitoring
time="2020-08-18T11:23:39Z" level=info msg="Attempting to restore Pod: alertmanager-prometheus-operator-alertmanager-0" logSource="pkg/restore/restore.go:1070" restore=velero/monitoring
time="2020-08-18T11:23:39Z" level=debug msg="Acquiring lock" backupLocation=default logSource="pkg/restic/repository_ensurer.go:122" volumeNamespace=monitoring
time="2020-08-18T11:23:39Z" level=debug msg="Acquired lock" backupLocation=default logSource="pkg/restic/repository_ensurer.go:131" volumeNamespace=monitoring
time="2020-08-18T11:23:39Z" level=debug msg="Ready repository found" backupLocation=default logSource="pkg/restic/repository_ensurer.go:147" volumeNamespace=monitoring
time="2020-08-18T11:23:39Z" level=debug msg="Released lock" backupLocation=default logSource="pkg/restic/repository_ensurer.go:128" volumeNamespace=monitoring

1 second later, the Alertmanager object is restored:

time="2020-08-18T11:23:40Z" level=info msg="Restoring resource 'alertmanagers.monitoring.coreos.com' into namespace 'monitoring-restored'" logSource="pkg/restore/restore.go:702" restore=velero/monitoring
time="2020-08-18T11:23:40Z" level=info msg="Getting client for monitoring.coreos.com/v1, Kind=Alertmanager" logSource="pkg/restore/restore.go:746" restore=velero/monitoring
time="2020-08-18T11:23:40Z" level=info msg="Attempting to restore Alertmanager: prometheus-operator-alertmanager" logSource="pkg/restore/restore.go:1070" restore=velero/monitoring

This is the backup's content:

velero backup describe monitoring --details
Name:         monitoring
Namespace:    velero
Labels:       velero.io/storage-location=default
Annotations:  velero.io/source-cluster-k8s-gitversion=v1.16.8-eks-e16311
              velero.io/source-cluster-k8s-major-version=1
              velero.io/source-cluster-k8s-minor-version=16+

Phase:  Completed

Errors:    0
Warnings:  0

Namespaces:
  Included:  monitoring
  Excluded:  <none>

Resources:
  Included:        *
  Excluded:        certificates.cert-manager.io, certificaterequests.cert-manager.io, orders.acme.cert-manager.io
  Cluster-scoped:  auto

Label selector:  <none>

Storage Location:  default

Velero-Native Snapshot PVs:  auto

TTL:  720h0m0s

Hooks:  <none>

Backup Format Version:  1

Started:    2020-08-18 10:17:16 +0200 CEST
Completed:  2020-08-18 10:17:57 +0200 CEST

Expiration:  2020-09-17 10:17:16 +0200 CEST

Total items to be backed up:  234
Items backed up:              234

Resource List:
  apiextensions.k8s.io/v1/CustomResourceDefinition:
    - alertmanagers.monitoring.coreos.com
    - prometheuses.monitoring.coreos.com
    - prometheusrules.monitoring.coreos.com
    - servicemonitors.monitoring.coreos.com
  apps/v1/ControllerRevision:
    - monitoring/alertmanager-prometheus-operator-alertmanager-54df75fb5b
    - monitoring/prometheus-operator-prometheus-node-exporter-599f4fbbfd
    - monitoring/prometheus-prometheus-operator-prometheus-6cbd9d8d8b
  apps/v1/DaemonSet:
    - monitoring/prometheus-operator-prometheus-node-exporter
  apps/v1/Deployment:
    - monitoring/prometheus-operator-grafana
    - monitoring/prometheus-operator-kube-state-metrics
    - monitoring/prometheus-operator-operator
  apps/v1/ReplicaSet:
    - monitoring/prometheus-operator-grafana-5986dbf74f
    - monitoring/prometheus-operator-grafana-7ff4f8b97b
    - monitoring/prometheus-operator-kube-state-metrics-6f8cc5ffd5
    - monitoring/prometheus-operator-operator-fd978d8d7
  apps/v1/StatefulSet:
    - monitoring/alertmanager-prometheus-operator-alertmanager
    - monitoring/prometheus-prometheus-operator-prometheus
  extensions/v1beta1/Ingress:
    - monitoring/prometheus-operator-alertmanager
    - monitoring/prometheus-operator-grafana
    - monitoring/prometheus-operator-prometheus
  monitoring.coreos.com/v1/Alertmanager:
    - monitoring/prometheus-operator-alertmanager
  monitoring.coreos.com/v1/Prometheus:
    - monitoring/prometheus-operator-prometheus
  monitoring.coreos.com/v1/PrometheusRule:
    - monitoring/prometheus-operator-alertmanager.rules
    - monitoring/prometheus-operator-etcd
    - monitoring/prometheus-operator-general.rules
    - monitoring/prometheus-operator-k8s.rules
    - monitoring/prometheus-operator-kube-apiserver-slos
    - monitoring/prometheus-operator-kube-apiserver.rules
    - monitoring/prometheus-operator-kube-prometheus-general.rules
    - monitoring/prometheus-operator-kube-prometheus-node-recording.rules
    - monitoring/prometheus-operator-kube-scheduler.rules
    - monitoring/prometheus-operator-kube-state-metrics
    - monitoring/prometheus-operator-kubelet.rules
    - monitoring/prometheus-operator-kubernetes-apps
    - monitoring/prometheus-operator-kubernetes-resources
    - monitoring/prometheus-operator-kubernetes-storage
    - monitoring/prometheus-operator-kubernetes-system
    - monitoring/prometheus-operator-kubernetes-system-apiserver
    - monitoring/prometheus-operator-kubernetes-system-controller-manager
    - monitoring/prometheus-operator-kubernetes-system-kubelet
    - monitoring/prometheus-operator-kubernetes-system-scheduler
    - monitoring/prometheus-operator-node-exporter
    - monitoring/prometheus-operator-node-exporter.rules
    - monitoring/prometheus-operator-node-network
    - monitoring/prometheus-operator-node.rules
    - monitoring/prometheus-operator-prometheus
    - monitoring/prometheus-operator-prometheus-operator
  monitoring.coreos.com/v1/ServiceMonitor:
    - monitoring/prometheus-operator-alertmanager
    - monitoring/prometheus-operator-apiserver
    - monitoring/prometheus-operator-coredns
    - monitoring/prometheus-operator-grafana
    - monitoring/prometheus-operator-kube-controller-manager
    - monitoring/prometheus-operator-kube-etcd
    - monitoring/prometheus-operator-kube-proxy
    - monitoring/prometheus-operator-kube-scheduler
    - monitoring/prometheus-operator-kube-state-metrics
    - monitoring/prometheus-operator-kubelet
    - monitoring/prometheus-operator-node-exporter
    - monitoring/prometheus-operator-operator
    - monitoring/prometheus-operator-prometheus
  networking.k8s.io/v1beta1/Ingress:
    - monitoring/prometheus-operator-alertmanager
    - monitoring/prometheus-operator-grafana
    - monitoring/prometheus-operator-prometheus
  rbac.authorization.k8s.io/v1/ClusterRole:
    - prometheus-operator-grafana-clusterrole
    - prometheus-operator-kube-state-metrics
    - prometheus-operator-operator
    - prometheus-operator-operator-psp
    - prometheus-operator-prometheus
    - prometheus-operator-prometheus-psp
    - psp-prometheus-operator-kube-state-metrics
    - psp-prometheus-operator-prometheus-node-exporter
  rbac.authorization.k8s.io/v1/ClusterRoleBinding:
    - prometheus-operator-grafana-clusterrolebinding
    - prometheus-operator-kube-state-metrics
    - prometheus-operator-operator
    - prometheus-operator-operator-psp
    - prometheus-operator-prometheus
    - prometheus-operator-prometheus-psp
    - psp-prometheus-operator-kube-state-metrics
    - psp-prometheus-operator-prometheus-node-exporter
  rbac.authorization.k8s.io/v1/Role:
    - monitoring/prometheus-operator-alertmanager
    - monitoring/prometheus-operator-grafana
    - monitoring/prometheus-operator-grafana-test
  rbac.authorization.k8s.io/v1/RoleBinding:
    - monitoring/prometheus-operator-alertmanager
    - monitoring/prometheus-operator-grafana
    - monitoring/prometheus-operator-grafana-test
  v1/ConfigMap:
    - monitoring/prometheus-operator-apiserver
    - monitoring/prometheus-operator-cluster-total
    - monitoring/prometheus-operator-controller-manager
    - monitoring/prometheus-operator-etcd
    - monitoring/prometheus-operator-grafana
    - monitoring/prometheus-operator-grafana-config-dashboards
    - monitoring/prometheus-operator-grafana-datasource
    - monitoring/prometheus-operator-grafana-test
    - monitoring/prometheus-operator-k8s-coredns
    - monitoring/prometheus-operator-k8s-resources-cluster
    - monitoring/prometheus-operator-k8s-resources-namespace
    - monitoring/prometheus-operator-k8s-resources-node
    - monitoring/prometheus-operator-k8s-resources-pod
    - monitoring/prometheus-operator-k8s-resources-workload
    - monitoring/prometheus-operator-k8s-resources-workloads-namespace
    - monitoring/prometheus-operator-kubelet
    - monitoring/prometheus-operator-namespace-by-pod
    - monitoring/prometheus-operator-namespace-by-workload
    - monitoring/prometheus-operator-node-cluster-rsrc-use
    - monitoring/prometheus-operator-node-rsrc-use
    - monitoring/prometheus-operator-nodes
    - monitoring/prometheus-operator-persistentvolumesusage
    - monitoring/prometheus-operator-pod-total
    - monitoring/prometheus-operator-prometheus
    - monitoring/prometheus-operator-proxy
    - monitoring/prometheus-operator-scheduler
    - monitoring/prometheus-operator-statefulset
    - monitoring/prometheus-operator-workload-total
    - monitoring/prometheus-prometheus-operator-prometheus-rulefiles-0
  v1/Endpoints:
    - monitoring/alertmanager-operated
    - monitoring/prometheus-operated
    - monitoring/prometheus-operator-alertmanager
    - monitoring/prometheus-operator-grafana
    - monitoring/prometheus-operator-kube-state-metrics
    - monitoring/prometheus-operator-operator
    - monitoring/prometheus-operator-prometheus
    - monitoring/prometheus-operator-prometheus-node-exporter
  v1/Event:
    - monitoring/prometheus-operator-admission-create-ngxh5.162c4eac037b378f
    - monitoring/prometheus-operator-admission-create-ngxh5.162c4eac3d7a4c20
    - monitoring/prometheus-operator-admission-create-ngxh5.162c4eacfd856868
    - monitoring/prometheus-operator-admission-create-ngxh5.162c4ead0a39ac70
    - monitoring/prometheus-operator-admission-create-ngxh5.162c4ead13445eeb
    - monitoring/prometheus-operator-admission-create-ngxh5.162c4ead713ac0dc
    - monitoring/prometheus-operator-admission-create-ngxh5.162c4ead8cff268e
    - monitoring/prometheus-operator-admission-create.162c4eac0309e0cb
    - monitoring/prometheus-operator-admission-patch-4pt6r.162c4eb4cb068bca
    - monitoring/prometheus-operator-admission-patch-4pt6r.162c4eb517441275
    - monitoring/prometheus-operator-admission-patch-4pt6r.162c4eb51d3ac352
    - monitoring/prometheus-operator-admission-patch-4pt6r.162c4eb52b6739be
    - monitoring/prometheus-operator-admission-patch.162c4eb4ca533c92
    - monitoring/prometheus-operator-grafana-5986dbf74f-nv429.162c4e619ce31070
    - monitoring/prometheus-operator-grafana-5986dbf74f-nv429.162c4e637284176b
    - monitoring/prometheus-operator-grafana-5986dbf74f-nv429.162c4e71b870b6b4
    - monitoring/prometheus-operator-grafana-5986dbf74f-nv429.162c4e7b2a4186a1
    - monitoring/prometheus-operator-grafana-5986dbf74f-nv429.162c4e7ba8d7beae
    - monitoring/prometheus-operator-grafana-5986dbf74f-nv429.162c4e7c23594737
    - monitoring/prometheus-operator-grafana-5986dbf74f-nv429.162c4e7c2b84195d
    - monitoring/prometheus-operator-grafana-5986dbf74f-nv429.162c4e7c36882b14
    - monitoring/prometheus-operator-grafana-5986dbf74f-nv429.162c4e7c67022081
    - monitoring/prometheus-operator-grafana-5986dbf74f-nv429.162c4e7e1d1be052
    - monitoring/prometheus-operator-grafana-5986dbf74f-nv429.162c4e7e2fa25cfc
    - monitoring/prometheus-operator-grafana-5986dbf74f-nv429.162c4e7e40871cd9
    - monitoring/prometheus-operator-grafana-5986dbf74f-nv429.162c4e7ec738b8ab
    - monitoring/prometheus-operator-grafana-5986dbf74f-nv429.162c4e7eca575457
    - monitoring/prometheus-operator-grafana-5986dbf74f-nv429.162c4e7ed9a8b480
    - monitoring/prometheus-operator-grafana-5986dbf74f-nv429.162c4e7ed9d7309c
    - monitoring/prometheus-operator-grafana-5986dbf74f-nv429.162c4e826db5397b
    - monitoring/prometheus-operator-grafana-5986dbf74f-nv429.162c4e82883b0412
    - monitoring/prometheus-operator-grafana-5986dbf74f-nv429.162c4e82a018a2d3
    - monitoring/prometheus-operator-grafana-5986dbf74f-nv429.162c4eb25495f5c9
    - monitoring/prometheus-operator-grafana-5986dbf74f-nv429.162c4eb25496faf5
    - monitoring/prometheus-operator-grafana-5986dbf74f-q7q88.162c4e6199f96154
    - monitoring/prometheus-operator-grafana-5986dbf74f-q7q88.162c4e619a02fad3
    - monitoring/prometheus-operator-grafana-5986dbf74f-q7q88.162c4e619a049a56
    - monitoring/prometheus-operator-grafana-5986dbf74f.162c4e619c9d5316
    - monitoring/prometheus-operator-grafana-5986dbf74f.162c4eb254704b48
    - monitoring/prometheus-operator-grafana-7ff4f8b97b-jxwzk.162c4eaf32ce5cd2
    - monitoring/prometheus-operator-grafana-7ff4f8b97b-jxwzk.162c4eaf7c0b856d
    - monitoring/prometheus-operator-grafana-7ff4f8b97b-jxwzk.162c4eaf7f2b718e
    - monitoring/prometheus-operator-grafana-7ff4f8b97b-jxwzk.162c4eaf874cd7a1
    - monitoring/prometheus-operator-grafana-7ff4f8b97b-jxwzk.162c4eaf9133924c
    - monitoring/prometheus-operator-grafana-7ff4f8b97b-jxwzk.162c4eaf9468b3d9
    - monitoring/prometheus-operator-grafana-7ff4f8b97b-jxwzk.162c4eaf9decda56
    - monitoring/prometheus-operator-grafana-7ff4f8b97b-jxwzk.162c4eafce8b1887
    - monitoring/prometheus-operator-grafana-7ff4f8b97b-jxwzk.162c4eafd2252390
    - monitoring/prometheus-operator-grafana-7ff4f8b97b-jxwzk.162c4eafdbc8ec47
    - monitoring/prometheus-operator-grafana-7ff4f8b97b-jxwzk.162c4eafdc3af0c0
    - monitoring/prometheus-operator-grafana-7ff4f8b97b-jxwzk.162c4eafe8f543b5
    - monitoring/prometheus-operator-grafana-7ff4f8b97b-jxwzk.162c4eaff4fd17b2
    - monitoring/prometheus-operator-grafana-7ff4f8b97b.162c4eaf31f3b2ba
    - monitoring/prometheus-operator-grafana.162c4eaf3087e1e1
    - monitoring/prometheus-operator-grafana.162c4eb253680d39
    - monitoring/prometheus-operator-prometheus-node-exporter-slszj.162c4e71bff6bc71
    - monitoring/prometheus-operator-prometheus-node-exporter-slszj.162c4e7210a514b8
    - monitoring/prometheus-operator-prometheus-node-exporter-slszj.162c4e735b1ee1b9
    - monitoring/prometheus-operator-prometheus-node-exporter-slszj.162c4e7405ed1b22
    - monitoring/prometheus-operator-prometheus-node-exporter-slszj.162c4e74199ddb10
    - monitoring/prometheus-operator-prometheus-node-exporter-slszj.162c4e7b19464cfd
    - monitoring/prometheus-operator-prometheus-node-exporter-slszj.162c4e7b1a45f166
    - monitoring/prometheus-operator-prometheus-node-exporter.162c4e71bdefdbaf
    - monitoring/prometheus-operator-prometheus-node-exporter.162c4e7b1a499523
  v1/Namespace:
    - monitoring
  v1/PersistentVolume:
    - pvc-502cf99f-99fb-4a83-abd9-2a15bcf2a30d
    - pvc-7107894a-2ede-473e-9c24-2cb5a3f9d7f1
    - pvc-e6d638c0-b4a8-4bcf-a9d1-1f66c387c7e9
  v1/PersistentVolumeClaim:
    - monitoring/alertmanager-prometheus-operator-alertmanager-db-alertmanager-prometheus-operator-alertmanager-0
    - monitoring/prometheus-operator-grafana
    - monitoring/prometheus-prometheus-operator-prometheus-db-prometheus-prometheus-operator-prometheus-0
  v1/Pod:
    - monitoring/alertmanager-prometheus-operator-alertmanager-0
    - monitoring/prometheus-operator-grafana-7ff4f8b97b-jxwzk
    - monitoring/prometheus-operator-kube-state-metrics-6f8cc5ffd5-47jbw
    - monitoring/prometheus-operator-operator-fd978d8d7-cf956
    - monitoring/prometheus-operator-prometheus-node-exporter-fxl7s
    - monitoring/prometheus-prometheus-operator-prometheus-0
  v1/Secret:
    - monitoring/alertmanager-prometheus-operator-alertmanager
    - monitoring/alertmanager.ict.navinfo.cloud-tls
    - monitoring/default-token-vf8dm
    - monitoring/grafana.ict.navinfo.cloud-tls
    - monitoring/ict-admission
    - monitoring/prometheus-operator-admission
    - monitoring/prometheus-operator-alertmanager-token-jxljb
    - monitoring/prometheus-operator-grafana
    - monitoring/prometheus-operator-grafana-test-token-q5lsl
    - monitoring/prometheus-operator-grafana-token-949ch
    - monitoring/prometheus-operator-kube-state-metrics-token-9gsz5
    - monitoring/prometheus-operator-operator-token-556vs
    - monitoring/prometheus-operator-prometheus-node-exporter-token-9f545
    - monitoring/prometheus-operator-prometheus-token-bxb9w
    - monitoring/prometheus-prometheus-operator-prometheus
    - monitoring/prometheus-prometheus-operator-prometheus-tls-assets
    - monitoring/prometheus.ict.navinfo.cloud-tls
    - monitoring/sh.helm.release.v1.prometheus-operator.v1
    - monitoring/sh.helm.release.v1.prometheus-operator.v2
  v1/Service:
    - monitoring/alertmanager-operated
    - monitoring/prometheus-operated
    - monitoring/prometheus-operator-alertmanager
    - monitoring/prometheus-operator-grafana
    - monitoring/prometheus-operator-kube-state-metrics
    - monitoring/prometheus-operator-operator
    - monitoring/prometheus-operator-prometheus
    - monitoring/prometheus-operator-prometheus-node-exporter
  v1/ServiceAccount:
    - monitoring/default
    - monitoring/prometheus-operator-alertmanager
    - monitoring/prometheus-operator-grafana
    - monitoring/prometheus-operator-grafana-test
    - monitoring/prometheus-operator-kube-state-metrics
    - monitoring/prometheus-operator-operator
    - monitoring/prometheus-operator-prometheus
    - monitoring/prometheus-operator-prometheus-node-exporter

Velero-Native Snapshots: <none included>

Restic Backups:
  Completed:
    monitoring/alertmanager-prometheus-operator-alertmanager-0: alertmanager-prometheus-operator-alertmanager-db
    monitoring/prometheus-operator-grafana-7ff4f8b97b-jxwzk: storage
    monitoring/prometheus-prometheus-operator-prometheus-0: prometheus-prometheus-operator-prometheus-db

This is the restore details. Note that velero couldn't restore alertmanager-prometheus-operator-alertmanager Statefulset because it was already created by Alertmanager object. It couldn't restore prometheus-prometheus-operator-prometheus Statefulset also because it is created by Prometheus object (other prometheus-operator CRD). But, it's PV could be restored because the created Statefulset could "adopt" the restored POD. I have no clues why alertmanager Statefulset couldn't "adopt" the restored alertmanager POD. Perhaps a racing condition or something else...

Name:         monitoring
Namespace:    velero
Labels:       <none>
Annotations:  <none>

Phase:  PartiallyFailed (run 'velero restore logs monitoring' for more information)

Warnings:
  Velero:     <none>
  Cluster:  could not restore, customresourcedefinitions.apiextensions.k8s.io "alertmanagers.monitoring.coreos.com" already exists. Warning: the in-cluster version is different than the backed-up version.
            could not restore, customresourcedefinitions.apiextensions.k8s.io "prometheuses.monitoring.coreos.com" already exists. Warning: the in-cluster version is different than the backed-up version.
            could not restore, customresourcedefinitions.apiextensions.k8s.io "prometheusrules.monitoring.coreos.com" already exists. Warning: the in-cluster version is different than the backed-up version.
            could not restore, customresourcedefinitions.apiextensions.k8s.io "servicemonitors.monitoring.coreos.com" already exists. Warning: the in-cluster version is different than the backed-up version.
            could not restore, clusterrolebindings.rbac.authorization.k8s.io "prometheus-operator-grafana-clusterrolebinding" already exists. Warning: the in-cluster version is different than the backed-up version.
            could not restore, clusterrolebindings.rbac.authorization.k8s.io "prometheus-operator-kube-state-metrics" already exists. Warning: the in-cluster version is different than the backed-up version.
            could not restore, clusterrolebindings.rbac.authorization.k8s.io "prometheus-operator-operator-psp" already exists. Warning: the in-cluster version is different than the backed-up version.
            could not restore, clusterrolebindings.rbac.authorization.k8s.io "prometheus-operator-operator" already exists. Warning: the in-cluster version is different than the backed-up version.
            could not restore, clusterrolebindings.rbac.authorization.k8s.io "prometheus-operator-prometheus-psp" already exists. Warning: the in-cluster version is different than the backed-up version.
            could not restore, clusterrolebindings.rbac.authorization.k8s.io "prometheus-operator-prometheus" already exists. Warning: the in-cluster version is different than the backed-up version.
            could not restore, clusterrolebindings.rbac.authorization.k8s.io "psp-prometheus-operator-kube-state-metrics" already exists. Warning: the in-cluster version is different than the backed-up version.
            could not restore, clusterrolebindings.rbac.authorization.k8s.io "psp-prometheus-operator-prometheus-node-exporter" already exists. Warning: the in-cluster version is different than the backed-up version.
  Namespaces:
    monitoring-restored:  could not restore, endpoints "alertmanager-operated" already exists. Warning: the in-cluster version is different than the backed-up version.
                          could not restore, services "alertmanager-operated" already exists. Warning: the in-cluster version is different than the backed-up version.
                          could not restore, services "prometheus-operated" already exists. Warning: the in-cluster version is different than the backed-up version.
                          could not restore, statefulsets.apps "alertmanager-prometheus-operator-alertmanager" already exists. Warning: the in-cluster version is different than the backed-up version.
                          could not restore, statefulsets.apps "prometheus-prometheus-operator-prometheus" already exists. Warning: the in-cluster version is different than the backed-up version.

Errors:
  Velero:   timed out waiting for all PodVolumeRestores to complete
  Cluster:    <none>
  Namespaces: <none>

Backup:  monitoring

Namespaces:
  Included:  all namespaces found in the backup
  Excluded:  <none>

Resources:
  Included:        *
  Excluded:        nodes, events, events.events.k8s.io, backups.velero.io, restores.velero.io, resticrepositories.velero.io
  Cluster-scoped:  auto

Namespace mappings:  monitoring=monitoring-restored

Label selector:  <none>

Restore PVs:  auto

Restic Restores:
  Completed:
    monitoring-restored/prometheus-operator-grafana-7ff4f8b97b-jxwzk: storage
    monitoring-restored/prometheus-prometheus-operator-prometheus-0: prometheus-prometheus-operator-prometheus-db
  New:
    monitoring-restored/alertmanager-prometheus-operator-alertmanager-0: alertmanager-prometheus-operator-alertmanager-db

I'll try to first restore the Pods and PVs and then the rest.

@galindro
Copy link

The PVs restore by using the bellow command was successfully executed:

velero restore create monitoring-1 --from-backup monitoring --namespace-mappings monitoring:monitoring-restored \
  --exclude-resources=alertmanager.monitoring.coreos.com,prometheuses.monitoring.coreos.com

After that, I could restore without worries the Alertmanager and Prometheuses objects:

velero restore create monitoring-cdrs --from-backup monitoring --namespace-mappings monitoring:monitoring-restored \
  --include-resources=alertmanager.monitoring.coreos.com,prometheuses.monitoring.coreos.com

@eleanor-millman
Copy link
Contributor

Closing this because this issue was (mostly) resolved for the reporter, but @sseago or @dymurray feel free to reopen if you want to work on this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Restic Relates to the restic integration Reviewed Q2 2021
Projects
None yet
Development

No branches or pull requests

8 participants