Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

dotnet stack hangup on trying to get the stackframes of a stuck process #4826

Open
jhudsoncedaron opened this issue Jul 31, 2024 · 2 comments · May be fixed by #4996
Open

dotnet stack hangup on trying to get the stackframes of a stuck process #4826

jhudsoncedaron opened this issue Jul 31, 2024 · 2 comments · May be fixed by #4996
Milestone

Comments

@jhudsoncedaron
Copy link

jhudsoncedaron commented Jul 31, 2024

If I had enough information to file this as a bug report I'd file this as a bug report. It feels very much like a bug; but it might be a bug in the runtime, or something else. Anyway; this behavior is very bad and very unexpected.

Background:

We have this network listener process that's been getting stuck every week or so; the process is on our server and is receiving (encrypted) data from the process on the customer server. Our own internal status check on the stuck process also gets stuck; and the symptoms of the stuck-ness make no sense from an application codebase perspective. (Thankfully this process doesn't use async code so the stacktraces ought to make sense.)

So I said OK, lets get a stack trace next time. We looked up how to do this, found dotnet-stack, copied the standalone binary (this URL https://aka.ms/dotnet-stack/win-x64, a week and a half ago) to the server (it's a server core server), and waited for the next time for our process to get stuck.

So it got stuck, as expected. I than ran dotnet-stack report --process-id 4860 and it got stuck. In fact it got stuck so badly that ^C didn't get the command prompt back. I tried a second time; running dotnet-stack report --process-id 4860 > stack.txt and just leaving it running with the remote desktop window shoved in the background. After waiting for at least 14 minutes; found it it was still stuck; only this time ^C was able to get the command prompt back. As expected, the output file was empty.

The target process is an x64 .NET 8 process; working memory was 63MB.

We have a full memory dump of the process; the managed runtime is deadlocked.

Summary:

It's possible for dotnet-stack to get stuck trying to dump stack from a stuck process. This seems like it should not occur.

Environment:

Windows Server Core: probably server core 2022 but might be 2019
Hosting Environment: Azure (Central)
dotnet-stack: win64 standalone binary
target process: .NET 8 winx64 process; shipped as framework included (dotnet publish -r win-x64)

Reproducibility:

At this rate I get one attempt a week.

Stuck-ness does not appear to be data-related. On restarting the process it recovers where it left off, successfully processing the very message it hung up in the middle of.

@tommcdon tommcdon added this to the 10.0.0 milestone Sep 23, 2024
@noahfalk
Copy link
Member

noahfalk commented Oct 9, 2024

It's possible for dotnet-stack to get stuck trying to dump stack from a stuck process. This seems like it should not occur.

I think you talked about two different kinds of 'stuck':

  1. When you run dotnet-stack initially it sounded like you were waiting to see it print a stack trace to the console and it wasn't doing so. dotnet-stack is a cooperative tool that sends the .NET runtime a message using a named pipe and then waits to receive a reply back. There is a dedicated thread inside the runtime that is expected to process and reply to these messages, but if the process was in a sufficiently bad state then dotnet-stack might never get a reply. So depending on the state of the process this part may not be a bug, just a consequence that the tool is cooperative rather than preemptive. If you want something that can more reliably get the state of the process even when the runtime's private message reply thread is blocked a debugger is a good choice.

  2. When dotnet-stack didn't print anything for a while you used ctrl-C which you said also wasn't responding. I simulated a non-responsive target process and I think I have reproduced the ctrl-c not aborting portion of the issue. I'm investigating a fix for that.

@jhudsoncedaron
Copy link
Author

'if the process was in a sufficiently bad state then dotnet-stack might never get a reply"; turns out you are correct; the process in question was deadlocked in GC (which is another discussion thread).

At least the ^C not working can be fixed.

noahfalk added a commit to noahfalk/diagnostics that referenced this issue Oct 11, 2024
Fixes dotnet#4826

When running dotnet-stack against an unresponsive target process, there were various points where dotnet-stack wouldn't correctly cancel when Ctrl-C was pressed. There were several underlying issues:
- Cancellation caused EventPipeSession.Dispose() to run which attempted to send a Stop IPC command that might block indefinitely
- Several of the async operations dotnet-stack performed did not pass a cancellation token and so ignore when Ctrl-C is pressed
- The calls to start and stop the session were still using the synchronous API which both ignored the cancellation token and create the standard async-over-sync issues.

The change in behavior for EventPipeSession.Dispose() is strictly speaking a breaking change, although callers would need to emply some dubious code patterns to observe the difference. The most likely way code could observe the difference is if thread 1 is reading from the EventStream at the same time thread 2 called Dispose(). Previously this would have caused thread 1 to start receiving rundown events although it was also a race condition between thread 1 reading from the stream and thread 2 disposing the stream. Its possible some tool could have worked successfully if thread 1 always won the race in practice. If any code was doing that pattern then now thread 1 will observe the stream is disposed without seeing the rundown events first. The proper way to ensure seeing all the rundown events would be to explicitly call EventPipeSession.Stop(), then read all the remaining data and reach the end of stream marker, then Dispose() the session.

I looked through all the usage of EventPipeSession in our existing tools and it looked like all of them were already using Stop() properly.
@noahfalk noahfalk linked a pull request Oct 11, 2024 that will close this issue
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants