preempt-rt: dmesg causes device hang and reboot #1165

lfdmn · 2023-02-07T09:38:20Z

Describe the bug
I'm running on dunfell branch (R32.7.3 JetPack release: 4.6.3) with premp-rt patches applied. I disabled all the fragments except for CONFIG_OVERLAY_FS=y

The dmsg command hangs the kernel and the device reboots.

journalctl -k works without problem.

I see the same behavior with:

TX2 with custom carrier
Xavier AGX devkit

To Reproduce
Steps to reproduce the behavior:

clone dunfell branch and setup the bitbake environment
devtool modify linux-tegra
go to the workspace
cd scripts
./rt-patch.sh apply-patches
build a minimal image
Deploy to hardware
type dmsg
The kernel hangs with no prints in the console
The device reboot

Does anyone see the same behavior?

The text was updated successfully, but these errors were encountered:

ichergui · 2023-02-07T11:07:14Z

Hey @lfdmn

I didn't try that myself. I can do that and get back to you.
I will investigate the issue if I can reproduce it.
I will keep you updated

madisongh · 2023-02-07T11:16:13Z

Sounds like a locking problem. The systemd journal code uses the /dev/kmsg interface, whereas dmesg uses the syslog/klogctl system call interface, which might explain why one works and the other doesn't. It looks like there were some changes in the kernel/printk code (where the system call interface lives) between the 32.7.2 and 32.7.3 kernels, and the RT patches also change quite a bit in that area. Maybe one of the updated code paths got missed getting patched?

ichergui · 2023-02-07T11:17:30Z

Thanks @madisongh for the insights

lfdmn · 2023-02-07T11:53:32Z

Thanks for the details! I will drop a message to the forum about this.

lfdmn · 2023-03-29T13:05:00Z

Hi, small update on this issue.

Stepping through the code with gdbserver is enough to make the device hang and reboot. So it is not just related to dmesg as I originally thought.

Any hints where I could start looking for this kind of locking problems? I'm not familiar with kernel development :)

dwalkes · 2023-03-29T13:19:16Z

I've had some success with lock debugging in the kernel before: https://www.kernelconfig.io/config_prove_locking?q=&kernelversion=4.9&arch=arm64

However it works by spitting out prints to dmesg so I'm not sure how it will do if dmesg is related to the locking issue. You might want to experiment with redirecting prints to the console and see if that helps.

lfdmn · 2023-03-31T08:09:47Z

I have been trying different things but I'm out of luck redirecting the printk to console.

Here are some console prints when stepping with gdbserver if that would talk to someone.

[  416.446799] Internal error: Accessing user space memory outside uaccess.h routines: 96000005 [#1] PREEMPT SMP
[  416.446809] Modules linked in: 8021q garp mrp userspace_alert nvgpu
[  416.446812] CPU: 0 PID: 8426 Comm: gdbserver Tainted: P        W  O    4.9.299-rt191-l4t-r32.7.3+g62789adae3eb #1
[  416.446814] Hardware name: quill (DT)
[  416.446815] task: ffffffc1c6430000 task.stack: ffffffc15be9c000
[  416.446822] PC is at kmem_cache_free+0x200/0x428
[  416.446825] LR is at free_uid+0x78/0x88
[  416.446827] pc : [<ffffff80083edfb8>] lr : [<ffffff80082438b8>] pstate: a0400045
[  416.446828] sp : ffffffc15be9fbd0
[  416.446830] x29: ffffffc15be9fbd0 x28: ffffffc1e375da00 
[  416.446832] x27: ffffff8008f62000 x26: 00000000400004d8 
[  416.446835] x25: ffffffc1e1115b48 x24: ffffffc15be9fe08 
[  416.446837] x23: ffffffc15be9fcdf x22: ffffff80082438b8 
[  416.446838] x21: ffffff800a0b2a40 x20: ffffffc1e9001d00 
[  416.446840] x19: 0000000000000000 x18: 0000000000000001 
[  416.446842] x17: 0000000000000000 x16: ffffffc1c6430000 
[  416.446844] x15: ffffffffffffffff x14: 0000000000000000 
[  416.446845] x13: 0000000000000000 x12: 0000000000000000 
[  416.446847] x11: 0000000000000000 x10: 0000000000000000 
[  416.446849] x9 : 0000000000000000 x8 : 0000000000000000 
[  416.446851] x7 : ffffff80082ab62c x6 : 0000000000000000 
[  416.446852] x5 : 0000000000000000 x4 : 0000000000000001 
[  416.446854] x3 : 0000000000000000 x2 : ffffffbf00282c9f 
[  416.446856] x1 : ffffffbf00282ca0 x0 : ffffffbf00282c80 
[  416.446856] 
[  416.446858] Process gdbserver (pid: 8426, stack limit = 0xffffffc15be9c000)
[  416.446859] Call trace:
[  416.446862] [<ffffff80083edfb8>] kmem_cache_free+0x200/0x428
[  416.446864] [<ffffff80082438b8>] free_uid+0x78/0x88
[  416.446867] [<ffffff8008244a88>] __sigqueue_free.part.2+0x58/0x78
[  416.446869] [<ffffff8008244c6c>] collect_signal+0x10c/0x1c0
[  416.446870] [<ffffff8008245178>] dequeue_signal+0x130/0x210
[  416.446872] [<ffffff8008248e00>] get_signal+0x198/0xc10
[  416.446875] [<ffffff800820aff0>] do_signal+0xe0/0x5c8
[  416.446878] [<ffffff800820b690>] do_notify_resume+0xc8/0xe0
[  416.446880] [<ffffff8008202ff0>] work_pending+0x8/0x14
[  416.643337] ---[ end trace 0000000000000003 ]---
[  416.650377] ------------[ cut here ]------------
[  416.650379] kernel BUG at kernel/locking/rtmutex.c:1070!
[  416.650382] Internal error: Oops - BUG: 0 [#2] PREEMPT SMP
[  416.650392] Modules linked in: 8021q garp mrp userspace_alert nvgpu
[  416.650395] CPU: 0 PID: 8426 Comm: gdbserver Tainted: P      D W  O    4.9.299-rt191-l4t-r32.7.3+g62789adae3eb #1
[  416.650397] Hardware name: quill (DT)
[  416.650398] task: ffffffc1c6430000 task.stack: ffffffc15be9c000
[  416.650406] PC is at rt_spin_lock_slowlock+0x314/0x340
[  416.650408] LR is at rt_spin_lock_slowlock+0x68/0x340
[  416.650410] pc : [<ffffff8008f4f53c>] lr : [<ffffff8008f4f290>] pstate: 604000c5
[  416.650411] sp : ffffffc15be9f7a0
[  416.650413] x29: ffffffc15be9f7a0 x28: ffffffc1c6430000 
[  416.650415] x27: 0000000000000040 x26: 0000000000000007 
[  416.650417] x25: 0000000000000025 x24: 0000000000000001 
[  416.650419] x23: 0000000000000140 x22: 000000000000000b 
[  416.650421] x21: ffffffc1c6430000 x20: ffffffc15be9f800 
[  416.650422] x19: ffffffc1e1115b48 x18: 0000000000000008 
[  416.650424] x17: 0000000000000000 x16: ffffffc1c6430000 
[  416.650426] x15: 0000000000001fef x14: 000000000288b8db 
[  416.650428] x13: 000000000288b8db x12: 0000000005d3a0cf 
[  416.650429] x11: 0000000000000001 x10: 00000000000014a0 
[  416.650431] x9 : ffffffc15be9f830 x8 : 000000000000004b 
[  416.650433] x7 : ffffffc1c6430000 x6 : ffffff8008f4f274 
[  416.650434] x5 : 0000000000000000 x4 : 0000000000000001 
[  416.650436] x3 : ffffffc1e1115b90 x2 : ffffffc1c6430001 
[  416.650437] x1 : ffffffc1c6430000 x0 : ffffffc1c6430000 
[  416.650438] 
[  416.650440] Process gdbserver (pid: 8426, stack limit = 0xffffffc15be9c000)
[  416.650440] Call trace:
[  416.650444] [<ffffff8008f4f53c>] rt_spin_lock_slowlock+0x314/0x340
[  416.650447] [<ffffff8008f517d4>] rt_spin_lock+0x74/0x78
[  416.650452] [<ffffff80082f9fc4>] acct_collect+0x8c/0x170
[  416.650455] [<ffffff80082378b0>] do_exit+0x290/0xba0
[  416.650458] [<ffffff800820c58c>] die+0x194/0x198
[  416.650461] [<ffffff8008223bbc>] do_page_fault+0x15c/0x598
[  416.650463] [<ffffff8008224064>] do_translation_fault+0x6c/0x80
[  416.650465] [<ffffff8008200954>] do_mem_abort+0x54/0xa8
[  416.650466] [<ffffff8008202104>] el1_da+0x24/0xbc
[  416.650469] [<ffffff80082438b8>] free_uid+0x78/0x88
[  416.650471] [<ffffff8008244a88>] __sigqueue_free.part.2+0x58/0x78
[  416.650472] [<ffffff8008244c6c>] collect_signal+0x10c/0x1c0
[  416.650474] [<ffffff8008245178>] dequeue_signal+0x130/0x210
[  416.650476] [<ffffff8008248e00>] get_signal+0x198/0xc10
[  416.650478] [<ffffff800820aff0>] do_signal+0xe0/0x5c8
[  416.650479] [<ffffff800820b690>] do_notify_resume+0xc8/0xe0
[  416.650481] [<ffffff8008202ff0>] work_pending+0x8/0x14
[  416.900627] ---[ end trace 0000000000000004 ]---
[  416.906089] note: gdbserver[8426] exited with preempt_count 1
[  416.906091] Fixing recursive fault but reboot is needed!

rtmutex.c:1070 points to some mutex ownership issue:

/*
 * Slow path lock function spin_lock style: this variant is very
 * careful not to miss any non-lock wakeups.
 *
 * We store the current state under p->pi_lock in p->saved_state and
 * the try_to_wake_up() code handles this accordingly.
 */
static void  noinline __sched rt_spin_lock_slowlock(struct rt_mutex *lock,
						    bool mg_off)
{
	struct task_struct *lock_owner, *self = current;
	struct rt_mutex_waiter waiter, *top_waiter;
	unsigned long flags;
	int ret;

	rt_mutex_init_waiter(&waiter, true);

	raw_spin_lock_irqsave(&lock->wait_lock, flags);

	if (__try_to_take_rt_mutex(lock, self, NULL, STEAL_LATERAL)) {
		raw_spin_unlock_irqrestore(&lock->wait_lock, flags);
		return;
	}

	BUG_ON(rt_mutex_owner(lock) == self); // <------------------------

madisongh · 2023-07-05T11:29:21Z

@lfdmn Can you try with the R32.7.4 kernel? I see NVIDIA added another patch in the printk code to work around a locking issue, maybe that will help?

lfdmn · 2023-07-06T10:17:54Z

Thanks @madisongh.

I gave a shot at compiling the demo distro demo-image-base image and now systemd gets blocked at boot and I never get to the shell.

If someone is interrested, here are the logs:
systemd_blocked.txt

For the dmesg case NVidia has made a hack to disable do_syslog SYSLOG_ACTION_READ_ALL https://forums.developer.nvidia.com/t/jetpack-4-6-3-dmesg-hangs-the-kernel-and-the-device-reboots/241954/20?u=damien.lefevre
This ack is applied now in R32.7.4 /* WAR: Bug 3991287 */

I had tried it with meta-tegra and dmesg was reporting nothing. With the Ubuntu distro there is an output, I'm not sure why would that be.

With R32.7.3 I found other locks with gdbserver (reboot) and smartcl (no reboot but losing disk access).

Here is the current topic on the forum https://forums.developer.nvidia.com/t/jetpack-4-6-3-preempt-rt-patkernel-reboot-loop/257804/

ichergui · 2023-08-27T20:09:08Z

Hey @lfdmn
Any update on this ?

lfdmn · 2024-01-17T13:46:44Z

Hi,

Reviving an old issue

The lock in tegra demo distro is related the nvpmodel deadlocking during the boot. One workaround is to set nvpmodel to MAXN
NVPMODEL_CONFIG_DEFAULT = 0

Hot plugging CPU isn't stable https://forums.developer.nvidia.com/t/r32-7-1-4-9-253-rt168-info-possible-circular-locking-dependency-detected-nvpmodel-all-q-mutex-hp-lock/221391/9

But the lock may still appear in different places depending on timings and scheduled work.

The kernel has been fine up to 4.6.1. This is something new that came with the introduction of the hynix memory type.

I've tried using the latest JetPack and revert to 4.6.1 kernel but this is not possible. Loading the gpu kernel module fails. More than the memory has changed

I wonder if I have been flagged on the forums but I don't get any answers anymore :)

Does anyone have a way to reach out to NVidia to get their attention or would be able or know someone able to help debugging the patches?

lfdmn · 2024-02-14T10:37:58Z

This PR OE4T/linux-tegra-4.9#43 addresses the issue.

madisongh closed this as completed Mar 24, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

preempt-rt: dmesg causes device hang and reboot #1165

preempt-rt: dmesg causes device hang and reboot #1165

lfdmn commented Feb 7, 2023

ichergui commented Feb 7, 2023

madisongh commented Feb 7, 2023

ichergui commented Feb 7, 2023

lfdmn commented Feb 7, 2023

lfdmn commented Mar 29, 2023

dwalkes commented Mar 29, 2023

lfdmn commented Mar 31, 2023

madisongh commented Jul 5, 2023

lfdmn commented Jul 6, 2023

ichergui commented Aug 27, 2023

lfdmn commented Jan 17, 2024

lfdmn commented Feb 14, 2024

preempt-rt: dmesg causes device hang and reboot #1165

preempt-rt: dmesg causes device hang and reboot #1165

Comments

lfdmn commented Feb 7, 2023

ichergui commented Feb 7, 2023

madisongh commented Feb 7, 2023

ichergui commented Feb 7, 2023

lfdmn commented Feb 7, 2023

lfdmn commented Mar 29, 2023

dwalkes commented Mar 29, 2023

lfdmn commented Mar 31, 2023

madisongh commented Jul 5, 2023

lfdmn commented Jul 6, 2023

ichergui commented Aug 27, 2023

lfdmn commented Jan 17, 2024

lfdmn commented Feb 14, 2024