Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

zebra: Skip route table lookup if zvrf is NULL #16858

Closed

Conversation

jyuan-panw
Copy link

A crash was observed in 8.4.4 zebra when FRR processes were being shutting down by automated script:

Thread 1 (LWP 16051):
#0 0x00007f63f449bb8f in raise () from /lib64/libpthread.so.0
#1 0x00007f63f5c68300 in core_handler (signo=11, siginfo=0x7ffec6322cb0, context=) at lib/sigevent.c:261
#2
#3 zebra_router_get_table (zvrf=zvrf@entry=0x0, tableid=tableid@entry=254, afi=afi@entry=AFI_IP, safi=safi@entry=SAFI_UNICAST) at /usr/include/bits/string_fortified.h:71
#4 0x0000560d74192e6d in zebra_vrf_get_table_with_table_id (afi=AFI_IP, safi=SAFI_UNICAST, vrf_id=, table_id=254) at zebra/zebra_vrf.c:335
#5 0x0000560d74186d20 in process_subq_early_route_add (ere=) at zebra/zebra_rib.c:2649
#6 process_subq_early_route (lnode=0x560d7783c5f0) at zebra/zebra_rib.c:3127
#7 process_subq (qindex=META_QUEUE_EARLY_ROUTE, subq=0x560d75cb1e40) at zebra/zebra_rib.c:3150
#8 meta_queue_process (dummy=, data=0x560d75cba680) at zebra/zebra_rib.c:3202
#9 0x00007f63f5c84550 in work_queue_run (thread=0x7ffec63233d0) at lib/workqueue.c:285
#10 0x00007f63f5c7a4c1 in thread_call (thread=thread@entry=0x7ffec63233d0) at lib/thread.c:2008
#11 0x00007f63f5c32088 in frr_run (master=0x560d75acbf50) at lib/libfrr.c:1216
#12 0x0000560d7411c8f7 in main (argc=, argv=0x7ffec63237a8) at zebra/main.c:499

Below is analysis for the sequence of events which led to zebra crash:

  • configs including VRF configs were deleted in zebra
  • Some route messages for the deleted VRF were still in the zebra metaq waiting to be processed
  • when the route message was dequeued for processing, the VRF was already deleted
  • lookup of zvrf failed for the route vrf_id in route-entry, but the NULL return was not checked, resulted in SIGSEGV crash when it was dereferenced later

A crash was observed in 8.4.4 zebra when FRR processes were being shutting down by
automated script:

  Thread 1 (LWP 16051):
  #0  0x00007f63f449bb8f in raise () from /lib64/libpthread.so.0
  FRRouting#1  0x00007f63f5c68300 in core_handler (signo=11, siginfo=0x7ffec6322cb0, context=<optimized out>) at lib/sigevent.c:261
  FRRouting#2  <signal handler called>
  FRRouting#3  zebra_router_get_table (zvrf=zvrf@entry=0x0, tableid=tableid@entry=254, afi=afi@entry=AFI_IP, safi=safi@entry=SAFI_UNICAST) at /usr/include/bits/string_fortified.h:71
  FRRouting#4  0x0000560d74192e6d in zebra_vrf_get_table_with_table_id (afi=AFI_IP, safi=SAFI_UNICAST, vrf_id=<optimized out>, table_id=254) at zebra/zebra_vrf.c:335
  FRRouting#5  0x0000560d74186d20 in process_subq_early_route_add (ere=<optimized out>) at zebra/zebra_rib.c:2649
  FRRouting#6  process_subq_early_route (lnode=0x560d7783c5f0) at zebra/zebra_rib.c:3127
  FRRouting#7  process_subq (qindex=META_QUEUE_EARLY_ROUTE, subq=0x560d75cb1e40) at zebra/zebra_rib.c:3150
  FRRouting#8  meta_queue_process (dummy=<optimized out>, data=0x560d75cba680) at zebra/zebra_rib.c:3202
  FRRouting#9  0x00007f63f5c84550 in work_queue_run (thread=0x7ffec63233d0) at lib/workqueue.c:285
  FRRouting#10 0x00007f63f5c7a4c1 in thread_call (thread=thread@entry=0x7ffec63233d0) at lib/thread.c:2008
  FRRouting#11 0x00007f63f5c32088 in frr_run (master=0x560d75acbf50) at lib/libfrr.c:1216
  FRRouting#12 0x0000560d7411c8f7 in main (argc=<optimized out>, argv=0x7ffec63237a8) at zebra/main.c:499

Below is analysis for the sequence of events which led to zebra crash:

 - configs including VRF configs were deleted in zebra
 - Some route messages for the deleted VRF were still in the zebra metaq waiting
   to be processed
 - when the route message was dequeued for processing, the VRF was already deleted
 - lookup of zvrf failed for the route vrf_id in route-entry, but the NULL return
   was not checked, resulted in SIGSEGV crash when it was dereferenced later

Signed-off-by: Jenny Yuan <[email protected]>
Copy link
Contributor

@mjstapp mjstapp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks - just had one question

zebra/zebra_vrf.c Show resolved Hide resolved
@donaldsharp
Copy link
Member

I'm pretty sure that this crash has already been solved via a different methodology. Look in current vrf shutdown code, I am pretty sure we now iterate the early route queue and remove those routes that match that vrf.

@donaldsharp
Copy link
Member

donaldsharp commented Sep 18, 2024

In other words I would like to see a recreate of this crash in latest master before we accept this and I think the early route entry should be removed from the metaQ as the actual fix instead.

@jyuan-panw
Copy link
Author

Hi @donaldsharp , thanks for your review and comments. looks like this fix (e53fa58) would address the crash we saw in zebra.
We will re-validate zebra with this fix after we upgrade later, will close this PR for now. Thanks!

@jyuan-panw jyuan-panw closed this Sep 20, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants