Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Wait on the watcher at startup instead of just releasing resources associated with it #4834

Merged
merged 3 commits into from
Jun 7, 2024

Conversation

cmacknz
Copy link
Member

@cmacknz cmacknz commented May 31, 2024

Today every time the agent starts the watcher it results in a zombie process on Linux. This is because we never wait on it. There are two places where this happens:

  1. As part of the upgrade process:
    var watcherCmd *exec.Cmd
    if watcherCmd, err = InvokeWatcher(u.log, watcherExecutable); err != nil {
    u.log.Errorw("Rolling back: starting watcher failed", "error.message", err)
    rollbackErr := rollbackInstall(ctx, u.log, paths.Top(), hashedDir, currentVersionedHome)
    return nil, goerrors.Join(err, rollbackErr)
    }
  2. At startup in case we are starting as a result of an upgrade:
    // initiate agent watcher
    if _, err := upgrade.InvokeWatcher(l, paths.TopBinaryPath()); err != nil {
    // we should not fail because watcher is not working
    l.Error(errors.New(err, "failed to invoke rollback watcher"))
    }

I don't believe the first case is addressed by this change. This is because the agent process re-execs itself before the Wait can complete. I still have to do some experimenting to see exactly what happens in this case.

In the second case, 99% of the time the watcher exits quickly because the restart is not after an upgrade, and so simply waiting will prevent the zombie. If we are rolled back we are in a similar situation to case 1, as we may reexec before the wait completes.

This PR made me read the implementation of process.Release() which on Unix makes no system calls at all, it just cleans up resources:

https://cs.opensource.google/go/go/+/refs/tags/go1.22.3:src/os/exec_unix.go;l=85-91

func (p *Process) release() error {
	// NOOP for unix.
	p.Pid = -1
	// no need for a finalizer anymore
	runtime.SetFinalizer(p, nil)
	return nil
}

On Windows it makes a system call to close the handle:

https://cs.opensource.google/go/go/+/refs/tags/go1.22.3:src/os/exec_windows.go;l=65-77

func (p *Process) release() error {
	handle := atomic.SwapUintptr(&p.handle, uintptr(syscall.InvalidHandle))
	if handle == uintptr(syscall.InvalidHandle) {
		return syscall.EINVAL
	}
	e := syscall.CloseHandle(syscall.Handle(handle))
	if e != nil {
		return NewSyscallError("CloseHandle", e)
	}
	// no need for a finalizer anymore
	runtime.SetFinalizer(p, nil)
	return nil
}

The reason release() not making a system call is interesting is because this means that the watcher is still child of the agent process, meaning it dies when the agent dies. This cannot be changed without a system call happening on Unix. This means there is an opportunity where the agent is dead and the watcher is also dead, unable to roll back. I think we may need to use something like https:/sevlyar/go-daemon?tab=readme-ov-file#how-it-works to fix this. I suspect our existing test for this case isn't precise enough. The watcher from the agent starting the upgrade would need to be killed, and the agent we upgraded to would have to exit before it launched the watcher. This is not how it works today. I don't think this PR makes anything any worse but it doesn't fix this problem.

// A few seconds after the upgrade, deliberately restart upgraded Agent a
// couple of times to simulate Agent crashing.
for restartIdx := 0; restartIdx < 3; restartIdx++ {
time.Sleep(10 * time.Second)
topPath := paths.Top()
t.Logf("Stopping agent via service to simulate crashing")
err = install.StopService(topPath)
if err != nil && runtime.GOOS == define.Windows && strings.Contains(err.Error(), "The service has not been started.") {
// Due to the quick restarts every 10 seconds its possible that this is faster than Windows
// can handle. Decrementing restartIdx means that the loop will occur again.
t.Logf("Got an allowed error on Windows: %s", err)
err = nil
}
require.NoError(t, err)
// ensure that it's stopped before starting it again
var status service.Status
var statusErr error
require.Eventuallyf(t, func() bool {
status, statusErr = install.StatusService(topPath)
if statusErr != nil {
return false
}
return status != service.StatusRunning
}, 2*time.Minute, 1*time.Second, "service never fully stopped (status: %v): %s", status, statusErr)
t.Logf("Stopped agent via service to simulate crashing")
// start it again
t.Logf("Starting agent via service to simulate crashing")
err = install.StartService(topPath)

On Windows child processes do not share the lifetime of their parent process unless they are manually assigned to the same Job group, which we manually do when launching component sub-processes but do not do here. This means this should work as intended under all circumstances on Windows.

// Hook to JobObject on windows, noop on other platforms.
// This ties the application processes lifespan to the agent's.
// Fixes the orphaned beats processes left behind situation
// after the agent process gets killed.
if err := JobObject.Assign(cmd.Process); err != nil {
_ = killCmd(cmd.Process)
return nil, fmt.Errorf("failed job assignment %q: %w", path, err)
}

I am going to do some manual testing and open an issue if I can confirm there is a window where the watcher isn't running and the agent process can have exited. Fixing that requires more work.

@cmacknz cmacknz added the Team:Elastic-Agent-Control-Plane Label for the Agent Control Plane team label May 31, 2024
@cmacknz cmacknz self-assigned this May 31, 2024
@cmacknz cmacknz requested a review from a team as a code owner May 31, 2024 20:14
@elasticmachine
Copy link
Contributor

Pinging @elastic/elastic-agent-control-plane (Team:Elastic-Agent-Control-Plane)

@cmacknz cmacknz requested review from blakerouse and pchila May 31, 2024 20:15
Copy link
Contributor

mergify bot commented May 31, 2024

This pull request does not have a backport label. Could you fix it @cmacknz? 🙏
To fixup this pull request, you need to add the backport labels for the needed
branches, such as:

  • backport-v./d./d./d is the label to automatically backport to the 8./d branch. /d is the digit

NOTE: backport-skip has been added to this pull request.

log.Infow("Starting upgrade watcher", "path", cmd.Path, "args", cmd.Args, "env", cmd.Env, "dir", cmd.Dir)
if err := cmd.Start(); err != nil {
return nil, fmt.Errorf("failed to start Upgrade Watcher: %w", err)
}

upgradeWatcherPID := cmd.Process.Pid
agentPID := os.Getpid()

go func() {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't added a test for this, because there is a test in #4822 that catches this 100% of the time.

Specifically it is this block of fixture_install.go:

	f.t.Cleanup(func() {
		// check for running agents after uninstall had a chance to run
		processes := getElasticAgentProcesses(f.t)

		// there can be a single agent left when using --develop mode
		if f.installOpts != nil && f.installOpts.Develop {
			assert.LessOrEqual(f.t, len(processes), 1, "More than one agent left running at the end of the test when --develop was used: %v", processes)
			// The agent left running has to be the non-development agent. The development agent should be uninstalled first as a convention.
			if len(processes) > 0 {
				assert.NotContains(f.t, processes[0].Cmdline, paths.DevelopmentInstallDirName,
					"The agent installed with --develop was left running at the end of the test or was not uninstalled first: %v", processes)
			}
			return
		}

		assert.Empty(f.t, processes, "there should be no running agent at the end of the test")
	})

The assert.LessOrEqual(f.t, len(processes), 1, ... will fail without this change, because if there is an agent process running we alway pick up 2. One is the main process, and the other is the zombie created from the watcher.

That test can't be extracted from that PR since it depends on the --develop option, and I didn't think it made sense to add another separate test, but let me know and I'll come up with something. Probably an integration test that calls elastic-agent restart and checks to make sure there's only one process running afterwards.

Copy link
Member

@pchila pchila left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change would make the current situation not worse for the watcher launched during the upgrade and avoid a zombie at agent startup if the watch ends before the main elastic-agent process

I still believe that to completely solve the zombie process problem we will end up with some C code (either written by ourselves or implemented by a dependency) to properly detach the watcher or we will have some service manager (like systemd) to do it for us...

@blakerouse
Copy link
Contributor

The reason release() not making a system call is interesting is because this means that the watcher is still child of the agent process, meaning it dies when the agent dies.

Why would the watcher die if the parent dies? https:/elastic/elastic-agent/blob/main/internal/pkg/agent/application/upgrade/rollback_linux.go#L41 has Pdeathsig set to SIGINT. That means that when the starting process dies that SIGINT is sent to the child (I believe that is the meaning). The watcher would then exit? If that is the case we should not set that and it would remain. @ycombinator did a lot of adjustments with the watcher and systemd as well to ensure that the watcher stays running. So understanding the exact scenario you are speaking of and ensuring that we fix it, is important.

Copy link
Contributor

@blakerouse blakerouse left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree this improves the situation but doesn't solve the re-exec problem.

@cmacknz cmacknz merged commit e2b816b into elastic:main Jun 7, 2024
11 of 14 checks passed
@cmacknz cmacknz deleted the wait-on-watcher-at-startup branch June 7, 2024 21:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport-skip Team:Elastic-Agent-Control-Plane Label for the Agent Control Plane team
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants