Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

check randomly returning true unexpectedly #2912

Closed
pcsegal opened this issue Feb 12, 2023 · 3 comments · Fixed by #2915
Closed

check randomly returning true unexpectedly #2912

pcsegal opened this issue Feb 12, 2023 · 3 comments · Fixed by #2915
Labels
Milestone

Comments

@pcsegal
Copy link

pcsegal commented Feb 12, 2023

Brief summary

During local tests with a large number of preallocated VUs, I have encountered situations
where check should have returned false, but it returned true.

This is difficult to reproduce, since it happens randomly. It seems to be caused by a race condition.

k6 version

0.42.0.

OS

WSL2 (Ubuntu 20.04) in Windows 10.

Docker version and image (if applicable)

Docker version: 23.0.0; image: loadimpact/k6:0.42.0.

Steps to reproduce the problem

Unfortunately, this problem is difficult to reproduce. It seems to be caused by a race condition,
and I haven't been able to find a scenario under which it is guaranteed to happen.

You may need to run the below steps multiple times before you see the issue come up.

I was able to see this problem happen several times when running the following minimal example in Docker, using the following command:

cat script.js | docker run --cpus=2 --memory=4G -i loadimpact/k6:0.42.0 run -

The script.js file has the following content:

import { check, sleep } from 'k6';

export const options = {
    scenarios: {
        test: {
            executor: "constant-arrival-rate",
            preAllocatedVUs: 20000,
            rate: 10000,
            timeUnit: "1s",
            duration: "60s",
            exec: "default",
        },
    }
};

function randomNumber() {
    return (Math.random() * 5) + 2
}

export default function() {
    sleep(randomNumber())
    const checkOutput1 = check(null, {
        "my check 1": (res) => false,
    })
    if (checkOutput1) {
        console.log("checkOutput1:", checkOutput1)
    }
    sleep(randomNumber())
    const checkOutput2 = check(null, {
        "my check 2": (res) => false,
    })
    if (checkOutput2) {
        console.log("checkOutput1:", checkOutput1, "checkOutput2:", checkOutput2)
    }
}

Expected behaviour

The console.log lines should never be printed, since both checks always return false.

Actual behaviour

Sometimes, randomly, one (or both) of the console.log lines is printed.

@pcsegal pcsegal added the bug label Feb 12, 2023
@mstoykov
Copy link
Contributor

Thanks for reporting this 🙇

I have been running a (slightly modified in VUs , rate and sleep) test that does 500iter/s for around 10h at this point.

There hasn't been a single occurance of this :(

Can you please check that your memory is stable with memtest86 (for example). Do you have other stability issues or applications that act strangely sometimes?

I am going to continue running the test and will even try some more experiments. Looking at the code the only explanation will be some kind of very strange bug in the js VM ... which is unlikely, but still 🤷.

@mstoykov mstoykov added evaluation needed proposal needs to be validated or tested before fully implementing it in k6 awaiting user waiting for user to respond labels Feb 13, 2023
@pcsegal
Copy link
Author

pcsegal commented Feb 13, 2023

Thank you for the reply.

I hadn't thought of memory errors. That script was being run in a managed cloud environment, and I can't run something like memtest86 in this environment, but I'm trying to run memtester. It hasn't reported errors so far, though, but I will continue trying to run it.

In my environment, this issue seems to become more likely when there is a lot of concurrent activity (thousands of iterations per second, reaching the maximum of preallocated VUs).

mstoykov added a commit that referenced this issue Feb 14, 2023
Before this check was only returning `false` if it also could emit it,
and true otherwise.

This should rarely be a problem as the same context that is being
checked here is the one that interrupts the VM - so no code will be able
to run after this.

Unfortunately the code checking if it should be emitted is racing with
the one that interrupts the VM. So it is possible for the VM to still
not be interrupted when a `check` returns a wrong value. It is even
possible for more code to be run before the interrupt is actually
called.

The code still checks the context as this also updates the internal
check structure and we don't want to that if the context is done.
The above should be changed with #2869

Fixes #2912
@mstoykov
Copy link
Contributor

Ok I managed to get it to happen and I was definitely not going to get it with what I was trying 🤦

tl;dr: this happens only when the test is stopping and the check gets "interrupted" mid checking, but the VM is not interrupted yet.

More in depth:

Is the place where the check will set the return to false:

k6/js/modules/k6/k6.go

Lines 230 to 231 in bbfe357

// A single failure makes the return value false.
succ = false

But you will never get to it if the context is done

k6/js/modules/k6/k6.go

Lines 211 to 213 in bbfe357

// Emit! (But only if we have a valid context.)
select {
case <-ctx.Done():

The context being "done" means that test is ending. But if the stars align the check will return (a true as it was never set to false) and continue executing the code before the code that interrupts the js VM runs

k6/js/runner.go

Lines 530 to 533 in bbfe357

go func() {
<-ctx.Done()
vu.Runtime.Interrupt(context.Canceled)
}()

I managed to reproduce this with even more VUs then the provided script and shorter duration and setting gracefulStop to "0s".

I do not get it every time even this way, but GOMAXPROCS=256 does make it more likely, and I managed to get this:

INFO[0010] context done                                  VU=2448
INFO[0010] checkOutput1: false checkOutput2: true 2448   source=console
INFO[0010] context done                                  VU=18682
INFO[0010] checkOutput1: false checkOutput2: true 18682  source=console
INFO[0010] context done                                  VU=18555
INFO[0010] checkOutput1: true 18555                      source=console
INFO[0010] context done                                  VU=18555
INFO[0010] checkOutput1: true checkOutput2: true 18555   source=console

Where I also log the VU in the script and added a line to log when the check code goes through the context done branch in the select above with the VU as well.

As it can be seen there is clear correlation of "context done" and then the check returning "wrongly".

This has been the case for 6 years, so I expect this is really hard to hit. Do you have a particular case where this happens a lot in a production script?

@mstoykov mstoykov removed evaluation needed proposal needs to be validated or tested before fully implementing it in k6 awaiting user waiting for user to respond labels Feb 14, 2023
@na-- na-- added this to the v0.44.0 milestone Feb 20, 2023
mstoykov added a commit that referenced this issue Mar 6, 2023
Before this check was only returning `false` if it also could emit it,
and true otherwise.

This should rarely be a problem as the same context that is being
checked here is the one that interrupts the VM - so no code will be able
to run after this.

Unfortunately the code checking if it should be emitted is racing with
the one that interrupts the VM. So it is possible for the VM to still
not be interrupted when a `check` returns a wrong value. It is even
possible for more code to be run before the interrupt is actually
called.

The code still checks the context as this also updates the internal
check structure and we don't want to that if the context is done.
The above should be changed with #2869

Fixes #2912
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants