Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

preempt-rt: dmesg causes device hang and reboot #1165

Closed
lfdmn opened this issue Feb 7, 2023 · 12 comments
Closed

preempt-rt: dmesg causes device hang and reboot #1165

lfdmn opened this issue Feb 7, 2023 · 12 comments

Comments

@lfdmn
Copy link
Contributor

lfdmn commented Feb 7, 2023

Describe the bug
I'm running on dunfell branch (R32.7.3 JetPack release: 4.6.3) with premp-rt patches applied. I disabled all the fragments except for CONFIG_OVERLAY_FS=y

The dmsg command hangs the kernel and the device reboots.

journalctl -k works without problem.

I see the same behavior with:

  • TX2 with custom carrier
  • Xavier AGX devkit

To Reproduce
Steps to reproduce the behavior:

  1. clone dunfell branch and setup the bitbake environment
  2. devtool modify linux-tegra
  3. go to the workspace
  4. cd scripts
  5. ./rt-patch.sh apply-patches
  6. build a minimal image
  7. Deploy to hardware
  8. type dmsg
  9. The kernel hangs with no prints in the console
  10. The device reboot

Does anyone see the same behavior?

@ichergui
Copy link
Member

ichergui commented Feb 7, 2023

Hey @lfdmn

I didn't try that myself. I can do that and get back to you.
I will investigate the issue if I can reproduce it.
I will keep you updated

@madisongh
Copy link
Member

Sounds like a locking problem. The systemd journal code uses the /dev/kmsg interface, whereas dmesg uses the syslog/klogctl system call interface, which might explain why one works and the other doesn't. It looks like there were some changes in the kernel/printk code (where the system call interface lives) between the 32.7.2 and 32.7.3 kernels, and the RT patches also change quite a bit in that area. Maybe one of the updated code paths got missed getting patched?

@ichergui
Copy link
Member

ichergui commented Feb 7, 2023

Thanks @madisongh for the insights

@lfdmn
Copy link
Contributor Author

lfdmn commented Feb 7, 2023

Thanks for the details! I will drop a message to the forum about this.

@lfdmn
Copy link
Contributor Author

lfdmn commented Mar 29, 2023

Hi, small update on this issue.

Stepping through the code with gdbserver is enough to make the device hang and reboot. So it is not just related to dmesg as I originally thought.

Any hints where I could start looking for this kind of locking problems? I'm not familiar with kernel development :)

@dwalkes
Copy link
Member

dwalkes commented Mar 29, 2023

I've had some success with lock debugging in the kernel before: https://www.kernelconfig.io/config_prove_locking?q=&kernelversion=4.9&arch=arm64

However it works by spitting out prints to dmesg so I'm not sure how it will do if dmesg is related to the locking issue. You might want to experiment with redirecting prints to the console and see if that helps.

@lfdmn
Copy link
Contributor Author

lfdmn commented Mar 31, 2023

I have been trying different things but I'm out of luck redirecting the printk to console.

Here are some console prints when stepping with gdbserver if that would talk to someone.

[  416.446799] Internal error: Accessing user space memory outside uaccess.h routines: 96000005 [#1] PREEMPT SMP
[  416.446809] Modules linked in: 8021q garp mrp userspace_alert nvgpu
[  416.446812] CPU: 0 PID: 8426 Comm: gdbserver Tainted: P        W  O    4.9.299-rt191-l4t-r32.7.3+g62789adae3eb #1
[  416.446814] Hardware name: quill (DT)
[  416.446815] task: ffffffc1c6430000 task.stack: ffffffc15be9c000
[  416.446822] PC is at kmem_cache_free+0x200/0x428
[  416.446825] LR is at free_uid+0x78/0x88
[  416.446827] pc : [<ffffff80083edfb8>] lr : [<ffffff80082438b8>] pstate: a0400045
[  416.446828] sp : ffffffc15be9fbd0
[  416.446830] x29: ffffffc15be9fbd0 x28: ffffffc1e375da00 
[  416.446832] x27: ffffff8008f62000 x26: 00000000400004d8 
[  416.446835] x25: ffffffc1e1115b48 x24: ffffffc15be9fe08 
[  416.446837] x23: ffffffc15be9fcdf x22: ffffff80082438b8 
[  416.446838] x21: ffffff800a0b2a40 x20: ffffffc1e9001d00 
[  416.446840] x19: 0000000000000000 x18: 0000000000000001 
[  416.446842] x17: 0000000000000000 x16: ffffffc1c6430000 
[  416.446844] x15: ffffffffffffffff x14: 0000000000000000 
[  416.446845] x13: 0000000000000000 x12: 0000000000000000 
[  416.446847] x11: 0000000000000000 x10: 0000000000000000 
[  416.446849] x9 : 0000000000000000 x8 : 0000000000000000 
[  416.446851] x7 : ffffff80082ab62c x6 : 0000000000000000 
[  416.446852] x5 : 0000000000000000 x4 : 0000000000000001 
[  416.446854] x3 : 0000000000000000 x2 : ffffffbf00282c9f 
[  416.446856] x1 : ffffffbf00282ca0 x0 : ffffffbf00282c80 
[  416.446856] 
[  416.446858] Process gdbserver (pid: 8426, stack limit = 0xffffffc15be9c000)
[  416.446859] Call trace:
[  416.446862] [<ffffff80083edfb8>] kmem_cache_free+0x200/0x428
[  416.446864] [<ffffff80082438b8>] free_uid+0x78/0x88
[  416.446867] [<ffffff8008244a88>] __sigqueue_free.part.2+0x58/0x78
[  416.446869] [<ffffff8008244c6c>] collect_signal+0x10c/0x1c0
[  416.446870] [<ffffff8008245178>] dequeue_signal+0x130/0x210
[  416.446872] [<ffffff8008248e00>] get_signal+0x198/0xc10
[  416.446875] [<ffffff800820aff0>] do_signal+0xe0/0x5c8
[  416.446878] [<ffffff800820b690>] do_notify_resume+0xc8/0xe0
[  416.446880] [<ffffff8008202ff0>] work_pending+0x8/0x14
[  416.643337] ---[ end trace 0000000000000003 ]---
[  416.650377] ------------[ cut here ]------------
[  416.650379] kernel BUG at kernel/locking/rtmutex.c:1070!
[  416.650382] Internal error: Oops - BUG: 0 [#2] PREEMPT SMP
[  416.650392] Modules linked in: 8021q garp mrp userspace_alert nvgpu
[  416.650395] CPU: 0 PID: 8426 Comm: gdbserver Tainted: P      D W  O    4.9.299-rt191-l4t-r32.7.3+g62789adae3eb #1
[  416.650397] Hardware name: quill (DT)
[  416.650398] task: ffffffc1c6430000 task.stack: ffffffc15be9c000
[  416.650406] PC is at rt_spin_lock_slowlock+0x314/0x340
[  416.650408] LR is at rt_spin_lock_slowlock+0x68/0x340
[  416.650410] pc : [<ffffff8008f4f53c>] lr : [<ffffff8008f4f290>] pstate: 604000c5
[  416.650411] sp : ffffffc15be9f7a0
[  416.650413] x29: ffffffc15be9f7a0 x28: ffffffc1c6430000 
[  416.650415] x27: 0000000000000040 x26: 0000000000000007 
[  416.650417] x25: 0000000000000025 x24: 0000000000000001 
[  416.650419] x23: 0000000000000140 x22: 000000000000000b 
[  416.650421] x21: ffffffc1c6430000 x20: ffffffc15be9f800 
[  416.650422] x19: ffffffc1e1115b48 x18: 0000000000000008 
[  416.650424] x17: 0000000000000000 x16: ffffffc1c6430000 
[  416.650426] x15: 0000000000001fef x14: 000000000288b8db 
[  416.650428] x13: 000000000288b8db x12: 0000000005d3a0cf 
[  416.650429] x11: 0000000000000001 x10: 00000000000014a0 
[  416.650431] x9 : ffffffc15be9f830 x8 : 000000000000004b 
[  416.650433] x7 : ffffffc1c6430000 x6 : ffffff8008f4f274 
[  416.650434] x5 : 0000000000000000 x4 : 0000000000000001 
[  416.650436] x3 : ffffffc1e1115b90 x2 : ffffffc1c6430001 
[  416.650437] x1 : ffffffc1c6430000 x0 : ffffffc1c6430000 
[  416.650438] 
[  416.650440] Process gdbserver (pid: 8426, stack limit = 0xffffffc15be9c000)
[  416.650440] Call trace:
[  416.650444] [<ffffff8008f4f53c>] rt_spin_lock_slowlock+0x314/0x340
[  416.650447] [<ffffff8008f517d4>] rt_spin_lock+0x74/0x78
[  416.650452] [<ffffff80082f9fc4>] acct_collect+0x8c/0x170
[  416.650455] [<ffffff80082378b0>] do_exit+0x290/0xba0
[  416.650458] [<ffffff800820c58c>] die+0x194/0x198
[  416.650461] [<ffffff8008223bbc>] do_page_fault+0x15c/0x598
[  416.650463] [<ffffff8008224064>] do_translation_fault+0x6c/0x80
[  416.650465] [<ffffff8008200954>] do_mem_abort+0x54/0xa8
[  416.650466] [<ffffff8008202104>] el1_da+0x24/0xbc
[  416.650469] [<ffffff80082438b8>] free_uid+0x78/0x88
[  416.650471] [<ffffff8008244a88>] __sigqueue_free.part.2+0x58/0x78
[  416.650472] [<ffffff8008244c6c>] collect_signal+0x10c/0x1c0
[  416.650474] [<ffffff8008245178>] dequeue_signal+0x130/0x210
[  416.650476] [<ffffff8008248e00>] get_signal+0x198/0xc10
[  416.650478] [<ffffff800820aff0>] do_signal+0xe0/0x5c8
[  416.650479] [<ffffff800820b690>] do_notify_resume+0xc8/0xe0
[  416.650481] [<ffffff8008202ff0>] work_pending+0x8/0x14
[  416.900627] ---[ end trace 0000000000000004 ]---
[  416.906089] note: gdbserver[8426] exited with preempt_count 1
[  416.906091] Fixing recursive fault but reboot is needed!

rtmutex.c:1070 points to some mutex ownership issue:

/*
 * Slow path lock function spin_lock style: this variant is very
 * careful not to miss any non-lock wakeups.
 *
 * We store the current state under p->pi_lock in p->saved_state and
 * the try_to_wake_up() code handles this accordingly.
 */
static void  noinline __sched rt_spin_lock_slowlock(struct rt_mutex *lock,
						    bool mg_off)
{
	struct task_struct *lock_owner, *self = current;
	struct rt_mutex_waiter waiter, *top_waiter;
	unsigned long flags;
	int ret;

	rt_mutex_init_waiter(&waiter, true);

	raw_spin_lock_irqsave(&lock->wait_lock, flags);

	if (__try_to_take_rt_mutex(lock, self, NULL, STEAL_LATERAL)) {
		raw_spin_unlock_irqrestore(&lock->wait_lock, flags);
		return;
	}

	BUG_ON(rt_mutex_owner(lock) == self); // <------------------------

@madisongh
Copy link
Member

@lfdmn Can you try with the R32.7.4 kernel? I see NVIDIA added another patch in the printk code to work around a locking issue, maybe that will help?

@lfdmn
Copy link
Contributor Author

lfdmn commented Jul 6, 2023

Thanks @madisongh.

I gave a shot at compiling the demo distro demo-image-base image and now systemd gets blocked at boot and I never get to the shell.

If someone is interrested, here are the logs:
systemd_blocked.txt

For the dmesg case NVidia has made a hack to disable do_syslog SYSLOG_ACTION_READ_ALL https://forums.developer.nvidia.com/t/jetpack-4-6-3-dmesg-hangs-the-kernel-and-the-device-reboots/241954/20?u=damien.lefevre
This ack is applied now in R32.7.4 /* WAR: Bug 3991287 */

I had tried it with meta-tegra and dmesg was reporting nothing. With the Ubuntu distro there is an output, I'm not sure why would that be.

With R32.7.3 I found other locks with gdbserver (reboot) and smartcl (no reboot but losing disk access).

Here is the current topic on the forum https://forums.developer.nvidia.com/t/jetpack-4-6-3-preempt-rt-patkernel-reboot-loop/257804/

@ichergui
Copy link
Member

Hey @lfdmn
Any update on this ?

@lfdmn
Copy link
Contributor Author

lfdmn commented Jan 17, 2024

Hi,

Reviving an old issue

The lock in tegra demo distro is related the nvpmodel deadlocking during the boot. One workaround is to set nvpmodel to MAXN
NVPMODEL_CONFIG_DEFAULT = 0

Hot plugging CPU isn't stable https://forums.developer.nvidia.com/t/r32-7-1-4-9-253-rt168-info-possible-circular-locking-dependency-detected-nvpmodel-all-q-mutex-hp-lock/221391/9

But the lock may still appear in different places depending on timings and scheduled work.

The kernel has been fine up to 4.6.1. This is something new that came with the introduction of the hynix memory type.

I've tried using the latest JetPack and revert to 4.6.1 kernel but this is not possible. Loading the gpu kernel module fails. More than the memory has changed

I wonder if I have been flagged on the forums but I don't get any answers anymore :)

Does anyone have a way to reach out to NVidia to get their attention or would be able or know someone able to help debugging the patches?

@lfdmn
Copy link
Contributor Author

lfdmn commented Feb 14, 2024

This PR OE4T/linux-tegra-4.9#43 addresses the issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants