zebra: Skip route table lookup if zvrf is NULL #16858

jyuan-panw · 2024-09-18T17:28:00Z

A crash was observed in 8.4.4 zebra when FRR processes were being shutting down by automated script:

Thread 1 (LWP 16051):
#0 0x00007f63f449bb8f in raise () from /lib64/libpthread.so.0
#1 0x00007f63f5c68300 in core_handler (signo=11, siginfo=0x7ffec6322cb0, context=) at lib/sigevent.c:261
#2
#3 zebra_router_get_table (zvrf=zvrf@entry=0x0, tableid=tableid@entry=254, afi=afi@entry=AFI_IP, safi=safi@entry=SAFI_UNICAST) at /usr/include/bits/string_fortified.h:71
#4 0x0000560d74192e6d in zebra_vrf_get_table_with_table_id (afi=AFI_IP, safi=SAFI_UNICAST, vrf_id=, table_id=254) at zebra/zebra_vrf.c:335
#5 0x0000560d74186d20 in process_subq_early_route_add (ere=) at zebra/zebra_rib.c:2649
#6 process_subq_early_route (lnode=0x560d7783c5f0) at zebra/zebra_rib.c:3127
#7 process_subq (qindex=META_QUEUE_EARLY_ROUTE, subq=0x560d75cb1e40) at zebra/zebra_rib.c:3150
#8 meta_queue_process (dummy=, data=0x560d75cba680) at zebra/zebra_rib.c:3202
#9 0x00007f63f5c84550 in work_queue_run (thread=0x7ffec63233d0) at lib/workqueue.c:285
#10 0x00007f63f5c7a4c1 in thread_call (thread=thread@entry=0x7ffec63233d0) at lib/thread.c:2008
#11 0x00007f63f5c32088 in frr_run (master=0x560d75acbf50) at lib/libfrr.c:1216
#12 0x0000560d7411c8f7 in main (argc=, argv=0x7ffec63237a8) at zebra/main.c:499

Below is analysis for the sequence of events which led to zebra crash:

configs including VRF configs were deleted in zebra
Some route messages for the deleted VRF were still in the zebra metaq waiting to be processed
when the route message was dequeued for processing, the VRF was already deleted
lookup of zvrf failed for the route vrf_id in route-entry, but the NULL return was not checked, resulted in SIGSEGV crash when it was dereferenced later

A crash was observed in 8.4.4 zebra when FRR processes were being shutting down by automated script: Thread 1 (LWP 16051): #0 0x00007f63f449bb8f in raise () from /lib64/libpthread.so.0 FRRouting#1 0x00007f63f5c68300 in core_handler (signo=11, siginfo=0x7ffec6322cb0, context=<optimized out>) at lib/sigevent.c:261 FRRouting#2 <signal handler called> FRRouting#3 zebra_router_get_table (zvrf=zvrf@entry=0x0, tableid=tableid@entry=254, afi=afi@entry=AFI_IP, safi=safi@entry=SAFI_UNICAST) at /usr/include/bits/string_fortified.h:71 FRRouting#4 0x0000560d74192e6d in zebra_vrf_get_table_with_table_id (afi=AFI_IP, safi=SAFI_UNICAST, vrf_id=<optimized out>, table_id=254) at zebra/zebra_vrf.c:335 FRRouting#5 0x0000560d74186d20 in process_subq_early_route_add (ere=<optimized out>) at zebra/zebra_rib.c:2649 FRRouting#6 process_subq_early_route (lnode=0x560d7783c5f0) at zebra/zebra_rib.c:3127 FRRouting#7 process_subq (qindex=META_QUEUE_EARLY_ROUTE, subq=0x560d75cb1e40) at zebra/zebra_rib.c:3150 FRRouting#8 meta_queue_process (dummy=<optimized out>, data=0x560d75cba680) at zebra/zebra_rib.c:3202 FRRouting#9 0x00007f63f5c84550 in work_queue_run (thread=0x7ffec63233d0) at lib/workqueue.c:285 FRRouting#10 0x00007f63f5c7a4c1 in thread_call (thread=thread@entry=0x7ffec63233d0) at lib/thread.c:2008 FRRouting#11 0x00007f63f5c32088 in frr_run (master=0x560d75acbf50) at lib/libfrr.c:1216 FRRouting#12 0x0000560d7411c8f7 in main (argc=<optimized out>, argv=0x7ffec63237a8) at zebra/main.c:499 Below is analysis for the sequence of events which led to zebra crash: - configs including VRF configs were deleted in zebra - Some route messages for the deleted VRF were still in the zebra metaq waiting to be processed - when the route message was dequeued for processing, the VRF was already deleted - lookup of zvrf failed for the route vrf_id in route-entry, but the NULL return was not checked, resulted in SIGSEGV crash when it was dereferenced later Signed-off-by: Jenny Yuan <[email protected]>

mjstapp

Thanks - just had one question

zebra/zebra_vrf.c

donaldsharp · 2024-09-18T18:58:10Z

I'm pretty sure that this crash has already been solved via a different methodology. Look in current vrf shutdown code, I am pretty sure we now iterate the early route queue and remove those routes that match that vrf.

donaldsharp · 2024-09-18T18:58:33Z

In other words I would like to see a recreate of this crash in latest master before we accept this and I think the early route entry should be removed from the metaQ as the actual fix instead.

jyuan-panw · 2024-09-20T04:57:24Z

Hi @donaldsharp , thanks for your review and comments. looks like this fix (e53fa58) would address the crash we saw in zebra.
We will re-validate zebra with this fix after we upgrade later, will close this PR for now. Thanks!

frrbot bot added the zebra label Sep 18, 2024

github-actions bot added master size/XS labels Sep 18, 2024

mjstapp reviewed Sep 18, 2024

View reviewed changes

zebra/zebra_vrf.c Show resolved Hide resolved

jyuan-panw closed this Sep 20, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

zebra: Skip route table lookup if zvrf is NULL #16858

zebra: Skip route table lookup if zvrf is NULL #16858

jyuan-panw commented Sep 18, 2024

mjstapp left a comment

donaldsharp commented Sep 18, 2024

donaldsharp commented Sep 18, 2024 •

edited

Loading

jyuan-panw commented Sep 20, 2024

zebra: Skip route table lookup if zvrf is NULL #16858

zebra: Skip route table lookup if zvrf is NULL #16858

Conversation

jyuan-panw commented Sep 18, 2024

mjstapp left a comment

Choose a reason for hiding this comment

donaldsharp commented Sep 18, 2024

donaldsharp commented Sep 18, 2024 • edited Loading

jyuan-panw commented Sep 20, 2024

donaldsharp commented Sep 18, 2024 •

edited

Loading