Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Image Automation controller silently stops working #286

Closed
1 task done
sjdweb opened this issue Jan 4, 2022 · 4 comments · Fixed by fluxcd/source-controller#606
Closed
1 task done

Image Automation controller silently stops working #286

sjdweb opened this issue Jan 4, 2022 · 4 comments · Fixed by fluxcd/source-controller#606
Assignees
Labels
bug Something isn't working

Comments

@sjdweb
Copy link

sjdweb commented Jan 4, 2022

Describe the bug

On our clusters, we had connectivity issues on Dec 26th. I noticed that since then, image automation failed to update (but ImagePolicies were up to date).

Here are the last logs for one of the controllers:

{"level":"error","ts":"2021-12-26T21:37:24.895Z","logger":"controller-runtime.manager.controller.imageupdateautomation","msg":"Reconciler error","reconciler group":"image.toolkit.fluxcd.io","reconciler kind":"ImageUpdateAutomation","name":"flux-system","namespace":"flux-system","error":"unable to clone 'ssh://[email protected]/myco/fleet-infra', error: SSH could not read data: Error waiting on socket"}

Since that point in time, the controller stopped working.
Killing the pod fixed this issue, but it'd be great if it could self-heal in this scenario?

Steps to reproduce

  1. Allow image automation controller to have connectivity issue from Git repo
  2. Observe that the controller will not try again to connect, or crash

Expected behavior

The controller should try again to reconcile as the connectivity would have been resolved.

Screenshots and recordings

No response

OS / Distro

Ubuntu 20.04

Flux version

flux version 0.16.1

Flux check

N/A

Git provider

GitHub

Container Registry provider

No response

Additional context

No response

Code of Conduct

  • I agree to follow this project's Code of Conduct
@stefanprodan stefanprodan transferred this issue from fluxcd/flux2 Jan 4, 2022
@stefanprodan stefanprodan added the bug Something isn't working label Jan 4, 2022
@squaremo squaremo self-assigned this Jan 5, 2022
@aholbreich
Copy link

Same for us.

but it alls seem to be go down to git level if i not mistaken. fluxcd/source-controller#439 (comment)

@squaremo
Copy link
Member

@aholbreich There are two related problems:

  1. getting "error: SSH could not read data: Error waiting on socket" in the logs
  2. the image-automation-controller stops doing anything after that message

I know you are seeing the log message, from the comment you linked -- are you also experiencing the second problem, that the controller stops doing anything?

@aholbreich
Copy link

Hi, no. Not really seen controller problems.

@pjbgf pjbgf self-assigned this Mar 7, 2022
@pjbgf
Copy link
Member

pjbgf commented Mar 7, 2022

Same as #282. I will be updating that thread instead, here's the latest comment:

#282 (comment)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
Status: Done
Development

Successfully merging a pull request may close this issue.

5 participants