Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Test executed twice instead of executing the next test #2997

Closed
pholica opened this issue Jun 7, 2024 · 18 comments · Fixed by #2998
Closed

Test executed twice instead of executing the next test #2997

pholica opened this issue Jun 7, 2024 · 18 comments · Fixed by #2998
Assignees
Labels
priority | must high priority, must be included in the next release step | execute Stuff related to the execute step
Milestone

Comments

@pholica
Copy link
Contributor

pholica commented Jun 7, 2024

I've executed following test plan using TMT: https:/pholica/tmt_testrepo/blob/6087865da91737f8bf9c9ff2d2f19da1311f6cac/plans/example.fmf
from repo: https:/pholica/tmt_testrepo/tree/6087865da91737f8bf9c9ff2d2f19da1311f6cac

During the execution, I hit a situation when the 01_log_plan_data was actually executed twice, second time as if it was 02_log_test_data. See the result of 03_list_root and output from 02_log_test_data.

I'm not hitting this behavior consistently, but rather randomly but quite frequently.

$ tmt run -a provision --how connect --guest 192.168.122.111
/var/tmp/tmt/run-033

/plans/example/broken_reboot
    discover
        how: fmf
        directory: /home/pholica/git/pholica/tmt_testrepo
        summary: 3 tests selected
    provision
        queued provision.provision task #1: default-0
        
        provision.provision task #1: default-0
        how: connect
        primary address: 192.168.122.111
        topology address: 192.168.122.111
        multihost name: default-0
        arch: x86_64
        distro: Red Hat Enterprise Linux 9.4 (Plow)
    
        summary: 1 guest provisioned
    prepare
        queued push task #1: push to default-0
        
        push task #1: push to default-0
    
        queued prepare task #1: requires on default-0
        
        prepare task #1: requires on default-0
        how: install
        summary: Install required packages
        name: requires
        where: default-0
        package: /usr/bin/flock
    
        queued pull task #1: pull from default-0
        
        pull task #1: pull from default-0
    
        summary: 1 preparation applied
    execute
        queued execute task #1: default-0 on default-0
        
        execute task #1: default-0 on default-0
        how: tmt
        progress:                                            
    
        summary: 3 tests executed
    report
        how: display
        summary: 2 tests passed and 1 test failed
    finish
    
        summary: 0 tasks completed

total: 2 tests passed and 1 test failed
$ cat /var/tmp/tmt/run-033/plans/example/broken_reboot/execute/data/guest/default-0/plans/example/tests/02_log_test_data-2/output.txt 
TMT_PLAN_DATA='/var/tmp/tmt/run-033/plans/example/broken_reboot/data'
plan_data_foobar
$ tmt --version
tmt version: 1.33.0
$ rpm -q tmt
tmt-1.33.0-1.fc39.noarch
@pholica
Copy link
Contributor Author

pholica commented Jun 7, 2024

Attaching tmt debug output and archive of the run:
duplicate_test_execution.zip
tmt_debug_output.txt

@happz happz added the step | execute Stuff related to the execute step label Jun 7, 2024
@lukaszachy
Copy link
Collaborator

Note: "provision -h connect" is much more likely to hit the issue.

@psss
Copy link
Collaborator

psss commented Jun 7, 2024

Sounds like an important one, let's try to squeeze this into 1.34.

@psss psss added this to the 1.34 milestone Jun 7, 2024
@psss psss changed the title TMT under certain circumstances executes test twice instead executing the next test Test executed twice instead of executing the next test Jun 7, 2024
@psss psss added the priority | must high priority, must be included in the next release label Jun 7, 2024
@lukaszachy
Copy link
Collaborator

Just a hunch, but could it be that the content of tmt-test-wrapper.sh-default-0-default-0 was not flushed on host yet and tmt pushed old code? All three tests are defined in the plan so tmt creates a script for them on the fly.

@happz
Copy link
Collaborator

happz commented Jun 7, 2024

Hmm, I'm unable to reproduce this behavior :(

BTW run -e TMT_DEBUG=1 should provide more info about what's happening in those wrappers (adds set -x).

Just a hunch, but could it be that the content of tmt-test-wrapper.sh-default-0-default-0 was not flushed on host yet and tmt pushed old code? All three tests are defined in the plan so tmt creates a script for them on the fly.

Could be, also rsync might fail silently and not upload the modified version to guest. I see no other option. From the log it's clear the code running was not the one from the second test, so it's clearly somewhere there.

@lukaszachy
Copy link
Collaborator

I've added cat tmt-test-wrapper.sh-default-0-default-0 to the test code and it is indeed same content.

[0 lzachar@ r_094054_1Oq ]$ cat /var/tmp/tmt/run-244/example/broken_reboot/execute/data/guest/default-0/example/tests/01_log_plan_data-1/output.txt 
TMT_PLAN_DATA='/var/tmp/tmt/run-244/example/broken_reboot/data'
plan_data_foobar
example.fmf
tmt-test-wrapper.sh-default-0-default-0
set -eo pipefail; echo "TMT_PLAN_DATA='${TMT_PLAN_DATA}'"
echo plan_data_foobar | tee ${TMT_PLAN_DATA}/plan_log.txt
touch /root/01_log_plan_data
ls
cat tmt-test-wrapper.sh-default-0-default-0
[0 lzachar@ r_094054_1Oq ]$ cat /var/tmp/tmt/run-244/example/broken_reboot/execute/data/guest/default-0/example/tests/02_log_test_data-2/output.txt 
TMT_PLAN_DATA='/var/tmp/tmt/run-244/example/broken_reboot/data'
plan_data_foobar
example.fmf
tmt-test-wrapper.sh-default-0-default-0
set -eo pipefail; echo "TMT_PLAN_DATA='${TMT_PLAN_DATA}'"
echo plan_data_foobar | tee ${TMT_PLAN_DATA}/plan_log.txt
touch /root/01_log_plan_data
ls
cat tmt-test-wrapper.sh-default-0-default-0

@lukaszachy
Copy link
Collaborator

So to be on safer side we should have unique names for tmt-test-wrapper.sh-default-0-default-0, right?

@happz
Copy link
Collaborator

happz commented Jun 7, 2024

Yeah, I can do that, I'm just curious why it's happening. I can't reproduce it, you obviously can, easily, so where's the problem :) Some mount option, some fsync setting?

happz added a commit that referenced this issue Jun 7, 2024
To prevent accidental reuse of the wrapper, reported in #2997, and
improve investigation by saving all wrappers, make the test shell
wrapper filename even more unique that it already was. Besides the phase
and guest names, include also test name and its serial number. That
should make the filename unlike any other filename created for other
tests.

See #2997
@happz
Copy link
Collaborator

happz commented Jun 7, 2024

So here's my proposal, #2998

Still, I can't reproduce the issue, so I have no idea how to trigger the error and test the fix. The filename is now unique, let me know whether it would help - or trigger rsync error as it cannot find the local file...

@lukaszachy
Copy link
Collaborator

This looks like the file was not modified on the guest, right?

11:22:00         test wrapper: /var/tmp/tmt/BUG/plan/broken_reboot/discover/default-0/tests/tmt-test-wrapper.sh-default-0-default-0
11:22:00         Test script: set -eo pipefail; touch /root/02_log_test_data

11:22:00         Write file '/var/tmp/tmt/BUG/plan/broken_reboot/discover/default-0/tests/tmt-test-wrapper.sh-default-0-default-0'.
11:22:00         Copy '/var/tmp/tmt/BUG/plan/broken_reboot/discover/default-0/tests/tmt-test-wrapper.sh-default-0-default-0' to '/var/tmp/tmt/BUG/plan/broken_reboot/discover/default-0/tests/tmt-test-wrapper.sh-default-0-default-0' on the guest.
11:22:00         Run command: rsync -vvv -s -p --chmod=755 -e 'ssh -oForwardX11=no -oStrictHostKeyChecking=no -oUserKnownHostsFile=/dev/null -oConnectionAttempts=5 -oConnectTimeout=60 -oServerAliveInterval=5 -oServerAliveCountMax=60 -oIdentitiesOnly=yes -i /usr/share/qa-tools/1minutetip/1minutetip -oPasswordAuthentication=no -S/run/user/17062/tmt/tmpszkwx4yq' /var/tmp/tmt/BUG/plan/broken_reboot/discover/default-0/tests/tmt-test-wrapper.sh-default-0-default-0 [email protected]:/var/tmp/tmt/BUG/plan/broken_reboot/discover/default-0/tests/tmt-test-wrapper.sh-default-0-default-0
11:22:00         environment
11:22:00         out: opening connection using: ssh -oForwardX11=no -oStrictHostKeyChecking=no -oUserKnownHostsFile=/dev/null -oConnectionAttempts=5 -oConnectTimeout=60 -oServerAliveInterval=5 -oServerAliveCountMax=60 -oIdentitiesOnly=yes -i /usr/share/qa-tools/1minutetip/1minutetip -oPasswordAuthentication=no -S/run/user/17062/tmt/tmpszkwx4yq -l root 192.168.122.192 rsync --server -svvvpe.LsfxCIvu  (19 args)
11:22:00         out: protected args: . /var/tmp/tmt/BUG/plan/broken_reboot/discover/default-0/tests/tmt-test-wrapper.sh-default-0-default-0  (2 args)
11:22:00         out: [sender] make_file(tmt-test-wrapper.sh-default-0-default-0,*,0)
11:22:00         out: send_file_list done
11:22:00         out: send_files starting
11:22:00         out: server_recv(2) starting pid=20996
11:22:00         out: recv_file_name(tmt-test-wrapper.sh-default-0-default-0)
11:22:00         out: received 1 names
11:22:00         out: recv_file_list done
11:22:00         out: get_local_name count=1 /var/tmp/tmt/BUG/plan/broken_reboot/discover/default-0/tests/tmt-test-wrapper.sh-default-0-default-0
11:22:00         out: generator starting pid=20996
11:22:00         out: delta-transmission enabled
11:22:00         out: recv_generator(tmt-test-wrapper.sh-default-0-default-0,0)
11:22:00         out: tmt-test-wrapper.sh-default-0-default-0 is uptodate
11:22:00         out: send_files(0, /var/tmp/tmt/BUG/plan/broken_reboot/discover/default-0/tests/tmt-test-wrapper.sh-default-0-default-0)
11:22:00         out: generate_files phase=1
11:22:00         out: send_files phase=1
11:22:00         out: recv_files(1) starting
11:22:00         out: recv_files(tmt-test-wrapper.sh-default-0-default-0)
11:22:00         out: recv_files phase=1
11:22:00         out: generate_files phase=2
11:22:00         out: send_files phase=2
11:22:00         out: send files finished
11:22:00         out: total: matches=0  hash_hits=0  false_alarms=0 data=0
11:22:00         out: recv_files phase=2
11:22:00         out: recv_files finished
11:22:00         out: generate_files phase=3
11:22:00         out: generate_files finished
11:22:01         out: 
11:22:01         out: sent 78 bytes  received 632 bytes  1,420.00 bytes/sec
11:22:01         out: total size is 47  speedup is 0.07
11:22:01         out: [sender] _exit_cleanup(code=0, file=main.c, line=1338): about to call exit(0)

@lukaszachy
Copy link
Collaborator

Ah, rsync has 'checksum' as "optional" decision factor to update files...

@lukaszachy
Copy link
Collaborator

WIth -c added to rsync options I'm no longer able to reproduce...

@happz
Copy link
Collaborator

happz commented Jun 7, 2024

:))

Anyway, we want both, right? More unique wrapper filenames and rsync considering the content of files it's syncing? Any risk in added -c? Things may take longer, because of computing checksums?

@lukaszachy
Copy link
Collaborator

lukaszachy commented Jun 7, 2024

Anyway, we want both, right? More unique wrapper filenames and rsync considering the content of files it's syncing? Any risk in added -c? Things may take longer, because of computing checksums?

I'm not sure how often -c will be the deciding factor. It should hit only for similar cases as above and even then only if pushing same file names. SO it shouldn't hurt us do both (hopefully rsync will first try time/attrs..) MOST pushing should have create files on host, right?

@lukaszachy
Copy link
Collaborator

Hm, manpage hints it is XOR --checksum, -c skip based on checksum, not mod-time & size..

@lukaszachy lukaszachy linked a pull request Jun 7, 2024 that will close this issue
2 tasks
@lukaszachy
Copy link
Collaborator

So, should we include both? The #2998 fixes the problem.

Based on manpage we can have rsync do either 'quick check' or 'checksum', I originally thought it checksum will be just 'make sure quick check was correct' but it replaces the check completely.

@happz
Copy link
Collaborator

happz commented Jun 7, 2024

With the patch, we shouldn't hit the problem when a filename changes its content, therefore the need for checksum would diminish.

Maybe we could add an envvar for additional rsync options, expose it in TF and see whether it increases testing times by hours or not, and decide WRT checksum after that.

happz added a commit that referenced this issue Jun 7, 2024
To prevent accidental reuse of the wrapper, reported in #2997, and
improve investigation by saving all wrappers, make the test shell
wrapper filename even more unique that it already was. Besides the phase
and guest names, include also test name and its serial number. That
should make the filename unlike any other filename created for other
tests.

See #2997
@psss
Copy link
Collaborator

psss commented Jun 10, 2024

Maybe we could add an envvar for additional rsync options, expose it in TF and see whether it increases testing times by hours or not, and decide WRT checksum after that.

I think this would be probably the safest way. I'm afraid that we can expect some significant slow down when the --checksum option is enabled, especially for larger files.

psss pushed a commit that referenced this issue Jun 10, 2024
To prevent accidental reuse of the wrapper, reported in #2997, and
improve investigation by saving all wrappers, make the test shell
wrapper filename even more unique that it already was. Besides the phase
and guest names, include also test name and its serial number. That
should make the filename unlike any other filename created for other
tests.

See #2997
happz added a commit that referenced this issue Jun 10, 2024
To prevent accidental reuse of the wrapper, reported in #2997, and
improve investigation by saving all wrappers, make the test shell
wrapper filename even more unique that it already was. Besides the phase
and guest names, include also test name and its serial number. That
should make the filename unlike any other filename created for other
tests.

See #2997
falconizmi pushed a commit that referenced this issue Jun 17, 2024
To prevent accidental reuse of the wrapper, reported in #2997, and
improve investigation by saving all wrappers, make the test shell
wrapper filename even more unique that it already was. Besides the phase
and guest names, include also test name and its serial number. That
should make the filename unlike any other filename created for other
tests.

See #2997
The-Mule pushed a commit to The-Mule/tmt that referenced this issue Oct 14, 2024
To prevent accidental reuse of the wrapper, reported in teemtee#2997, and
improve investigation by saving all wrappers, make the test shell
wrapper filename even more unique that it already was. Besides the phase
and guest names, include also test name and its serial number. That
should make the filename unlike any other filename created for other
tests.

See teemtee#2997
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
priority | must high priority, must be included in the next release step | execute Stuff related to the execute step
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants