Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Kitty on gentoo segfaulting with nvidia opengl #5662

Closed
hyegeek opened this issue Nov 12, 2022 · 11 comments
Closed

Kitty on gentoo segfaulting with nvidia opengl #5662

hyegeek opened this issue Nov 12, 2022 · 11 comments
Labels

Comments

@hyegeek
Copy link

hyegeek commented Nov 12, 2022

Describe the bug
After some recent updates to my system, kitty started segfaulting. Lots of frustrating debugging later, I've narrowed things down to issues with nvidia cards on the systems seeing the issue. Similar systems with an ATI card do not display the same problem. The segfault occurs shortly after creating the window, ie you will see a window flash up and then go away.

To futher complicate the debugging, trying to run kitty under strace or gdb results in kitty working fine, so I can't use those to figure out what is causing the segfault.

I don't know which package updates set things off, but I have one system where I have rolled back to the weekend's snapshots to get things working again and another system that I've left broken to try to debug things and get to the bottom of it.

To Reproduce
Steps to reproduce the behavior:

  1. run kitty

Screenshots
NA

Environment details
gentoo
kitty 0.26.4 and 0.26.5 have both been tried
mesa 22.1.7 and 22.2.3 have both been tried
what else can I provide to help to get to the bottom of this?

Additional context
Same result with kitty --config NONE

@hyegeek hyegeek added the bug label Nov 12, 2022
@hyegeek
Copy link
Author

hyegeek commented Nov 12, 2022

A bit more information. The kernel log shows

kitty[30393]: segfault at 0 ip 0000000000000000 sp 00007fff4f3792e8 error 14

That apperas to be

error 14 : attempt to execute code from an unmapped area.

If I could figure out what library it was in when this happens, I could probably get to the bottom of things, but as I mentioned before, any kinds of debugging seems to cause things to work without error.

@kovidgoyal
Copy link
Owner

Simply rollback your nvidia gpu drivers to find the problem version.

And you dont need to run with strace or gdb to get a stack trace you can
use coredumpctl for it. Though for best results you should build kitty
from source with

make debug

Although I am 99% certain this crash will not be in kitty code but in
the GPU drivers, so you would really need to build those with debug
symbols as well.

Unfortunately I dont own any nvidia hardware so I cant help you with it.

@hyegeek
Copy link
Author

hyegeek commented Nov 13, 2022

My distro does not currently have the older version available since they had security issues. Also, from working to not working, no nvidia upgrade was done, so it is some other interaction. I've been going through a list of the software that was updated and I have yet to find one that I can downgrade to fix the issue.

I don't have coredumpctl on my system, but you are right, I should be able to tell the system to dump a core and debug from there. Thanks.

I agree it is most likely not in kitty code directly, but I having trouble getting a suspect to go after.

@hyegeek
Copy link
Author

hyegeek commented Nov 13, 2022

Argh! It is in /usr/lib64/libnvidia-tls.so.390.154, but that is both working and not working depending on other updates.

The thing right before it on the stack is kitty/fast_data_types.so. Any idea on what it would be trying to do? That could help me identify the culprit.

@kovidgoyal
Copy link
Owner

Build kitty from source with make debug and the stack trace will tell
you exactly what it was doing.

@hyegeek
Copy link
Author

hyegeek commented Nov 13, 2022

Thanks for helping. The debug shows it running

ret = pthread_create(&self->io_thread, NULL, io_loop, self);

I don't think I'm understanding what I'm seeing. Does the following suggest any suspects to you?

#0 0x0000000000000000 in ?? ()
#1 0x00007f2a0aa011e7 in ?? () from /usr/lib64/libnvidia-tls.so.390.154
#2 0x00007f2a0d1ece2b in start (s=<fast_data_types.ChildMonitor at remote 0x7f2a0dbe2fb0>, a=) at kitty/child-monitor.c:217
#3 0x00007f2a1b4ba3cb in ?? () from /usr/lib64/libpython3.10.so.1.0
#4 0x00007f2a1b45748b in _PyEval_EvalFrameDefault () from /usr/lib64/libpython3.10.so.1.0
#5 0x00007f2a1b5a03f0 in ?? () from /usr/lib64/libpython3.10.so.1.0
#6 0x00007f2a1b45748b in _PyEval_EvalFrameDefault () from /usr/lib64/libpython3.10.so.1.0
#7 0x00007f2a1b5a03f0 in ?? () from /usr/lib64/libpython3.10.so.1.0
#8 0x00007f2a1b4588fd in _PyEval_EvalFrameDefault () from /usr/lib64/libpython3.10.so.1.0
#9 0x00007f2a1b5a03f0 in ?? () from /usr/lib64/libpython3.10.so.1.0
#10 0x00007f2a1b4af717 in _PyObject_FastCallDictTstate () from /usr/lib64/libpython3.10.so.1.0
#11 0x00007f2a1b4afa50 in _PyObject_Call_Prepend () from /usr/lib64/libpython3.10.so.1.0
#12 0x00007f2a1b5183e1 in ?? () from /usr/lib64/libpython3.10.so.1.0
#13 0x00007f2a1b4af5b1 in _PyObject_MakeTpCall () from /usr/lib64/libpython3.10.so.1.0
#14 0x00007f2a1b45a881 in _PyEval_EvalFrameDefault () from /usr/lib64/libpython3.10.so.1.0
#15 0x00007f2a1b5a03f0 in ?? () from /usr/lib64/libpython3.10.so.1.0
#16 0x00007f2a1b4588fd in _PyEval_EvalFrameDefault () from /usr/lib64/libpython3.10.so.1.0
#17 0x00007f2a1b5a03f0 in ?? () from /usr/lib64/libpython3.10.so.1.0
#18 0x00007f2a1b4588fd in _PyEval_EvalFrameDefault () from /usr/lib64/libpython3.10.so.1.0
#19 0x00007f2a1b5a03f0 in ?? () from /usr/lib64/libpython3.10.so.1.0
#20 0x00007f2a1b4588fd in _PyEval_EvalFrameDefault () from /usr/lib64/libpython3.10.so.1.0
#21 0x00007f2a1b5a0276 in PyEval_EvalCode () from /usr/lib64/libpython3.10.so.1.0
#22 0x00007f2a1b59a61d in ?? () from /usr/lib64/libpython3.10.so.1.0
#23 0x00007f2a1b4f7f0f in ?? () from /usr/lib64/libpython3.10.so.1.0
#24 0x00007f2a1b4588fd in _PyEval_EvalFrameDefault () from /usr/lib64/libpython3.10.so.1.0
#25 0x00007f2a1b5a03f0 in ?? () from /usr/lib64/libpython3.10.so.1.0
#26 0x00007f2a1b4588fd in _PyEval_EvalFrameDefault () from /usr/lib64/libpython3.10.so.1.0
#27 0x00007f2a1b5a03f0 in ?? () from /usr/lib64/libpython3.10.so.1.0
#28 0x00007f2a1b606557 in ?? () from /usr/lib64/libpython3.10.so.1.0
#29 0x00007f2a1b606ef7 in Py_RunMain () from /usr/lib64/libpython3.10.so.1.0
#30 0x0000562144259c9b in run_embedded (run_data=0x7ffef45a99c0) at kitty/launcher/main.c:203
#31 0x000056214425a213 in main (argc=1, argv=0x7ffef45acb48, envp=0x7ffef45acb58) at kitty/launcher/main.c:338

@kovidgoyal
Copy link
Owner

libnvidia-tls almost certainly does some management of thread local
state (tls is likely acronym for Thread Local State). pthread_create()
creates a new thread. The connection seems fairly straightforward. Some
bit of state somewhere in the nvidia driver is incorrect and
that is causing the crash when a new thread is created because some
invariant libnvidia-tls expects is not satisfied, or there is some bug in
libnividia-tls that some other change in the nvidia driver stack is
exposing. I dont use nvidia so I cant offer you more specific insight
than that.

@hyegeek
Copy link
Author

hyegeek commented Nov 13, 2022

Thanks again for your help, you've given me a few more things to look into.

However, it looks like I'm probably going to have to wait for an nvidia update. At least my work machine uses something else. Just another reason to continue replacing the nvidia cards I've been using. I miss the day when they were a sure bet.

@ionenwks
Copy link

ionenwks commented Nov 15, 2022

However, it looks like I'm probably going to have to wait for an nvidia update. At least my work machine uses something else. Just another reason to continue replacing the nvidia cards I've been using. I miss the day when they were a sure bet.

That's unlikely -- and even if there's one last release it probably won't address much, NVIDIA is dropping support 390.x branch that you're using next month. It's also the only branch with the libnvidia-tls double version nonsense (Edit: one breaks xorg drivers, the other breaks other things).

In Gentoo the 390 drivers will be masked w/ a security notice sometime next year (I can say that because I'm the maintainer for it), albeit still kept for as long as they kinda work (but really wouldn't expect much support for 390, your only real options is either new hardware or using nouveau).

@hyegeek
Copy link
Author

hyegeek commented Nov 15, 2022

Thanks. That's good to know. I guess it's time for that ATI card I've been looking at.

@merrittlj
Copy link

merrittlj commented Mar 5, 2024

I've been experiencing a similar fate with the most recent kitty build in Portage(Gentoo), I use the same 390.x nvidia drivers, and can confirm that when using gdb a segmentation fault does not happen. My backtrace looked similar to what has been shown in this thread, but, suprisingly, I was able to just build kitty from source(not in debug mode), and now the built binary runs fine without any segmentation faults.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants