Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Possible race condition in the testcloud plugin #2687

Closed
psss opened this issue Feb 15, 2024 · 2 comments · Fixed by #2695
Closed

Possible race condition in the testcloud plugin #2687

psss opened this issue Feb 15, 2024 · 2 comments · Fixed by #2695
Assignees
Labels
plugin | testcloud The testcloud virtual provision plugin
Milestone

Comments

@psss
Copy link
Collaborator

psss commented Feb 15, 2024

Seems that /tests/prepare/multihost sometimes fails to connect to the guest.

[guest-3]         multihost name: guest-3
[guest-3]         arch: x86_64
[guest-3]         distro: Fedora Linux 39 (Cloud Edition)
[guest-3]         kernel: 6.5.6-300.fc39.x86_64
[guest-3]         package manager: dnf
[guest-3]         selinux: yes
[guest-3]         is superuser: yes
[guest-1]         finished
[guest-1]         fail: Failed to connect in 300s.
    finish
    
[guest-2]         guest: stopped
[guest-2]         guest: removed
[guest-3]         guest: stopped
[guest-3]         guest: removed

Here's an example job and one more. As @happz mentioned in #2677 this stinks with race conditions. @frantisekz, could you please have a look?

@psss psss added the plugin | testcloud The testcloud virtual provision plugin label Feb 15, 2024
@psss psss added this to the 1.32 milestone Feb 15, 2024
@frantisekz frantisekz self-assigned this Feb 20, 2024
@frantisekz
Copy link
Collaborator

So, I digged into this a bit:

I was able to reproduce the issue with the following tmt plan (on some attempts):

/test:
    test: echo

/plan:
    execute:
        how: tmt
    discover:
        how: fmf

    provision:
        - how: virtual
        - how: virtual
        - how: virtual
        - how: virtual
        - how: virtual
        - how: virtual
        - how: virtual

The thing is, in cases where it fails like in the mentioned jobs, the ssh succeeds on another try. cloud-init data are generated and appended properly in the affected VMs, and my best guess is that the ssh connection is attempted before cloud-init finishes its job in the VM.

I've tried if something like disabling the ssh early boot would help (it would) and let cloud-init restart it only after it finishes what it needs to. The problem is that it seems to be impossible to pass grub arguments via libvirt (we would have to restructure it to use a direct kernel boot which is can of worms on its own).

The another possible way to handle it would be to append (tmt-side) "-o PasswordAuthentication=no" to the ssh connections that should be using ssh key. This way, the connection would fail instead of a password prompt and that should be handled just fine via tmt's retry mechanism already present.

I'll try to come up with a PR for this.

frantisekz added a commit to frantisekz/tmt that referenced this issue Feb 21, 2024
frantisekz added a commit to frantisekz/tmt that referenced this issue Feb 21, 2024
psss pushed a commit to frantisekz/tmt that referenced this issue Feb 21, 2024
psss pushed a commit to frantisekz/tmt that referenced this issue Feb 22, 2024
psss pushed a commit to frantisekz/tmt that referenced this issue Feb 28, 2024
lukaszachy pushed a commit to frantisekz/tmt that referenced this issue Mar 4, 2024
@psss psss closed this as completed in db39ff9 Mar 4, 2024
@psss
Copy link
Collaborator Author

psss commented Mar 19, 2024

@frantisekz, hmmm, seems the issue is still there. Here's a recent job where the multihost test failed. Now we have a detailed log as well.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
plugin | testcloud The testcloud virtual provision plugin
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants