Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enhanced SupervisorD to exit immediately if one of its managed process get crashed which causes respective docker container to stop.Then container will be restarted gracefully. #2208

Closed
wants to merge 1 commit into from

Conversation

samaity
Copy link
Collaborator

@samaity samaity commented Oct 29, 2018

- What I did

  • Enhanced SupervisorD to exit immediately if one of its managed process get crashed which causes
    respective docker container to stop.Then container will be restarted gracefully.

- Little Background

Supervisor provides a way for a specially written program (which it runs as a subprocess) called an “event listener” to subscribe to “event notifications”. An event notification implies that something happened related to a subprocess controlled by supervisord or to supervisord itself.

-How I did it

SupervisorD now subscribes to "EVENT Listener" process. A Python implementation of a “long-running” event listener which accepts an event notification and kill the supervisord if one of its managed process get crashed.

- Some Disclaimer

  • Modified supervisord.conf under syncd docker folder for every platform. Though, I tested it only on Broadcom Platform.

- How to verify it

killed "portmgrd" . As a result of it, swss docker got stopped.

admin@lnos-x1-a-asw01:~$ docker exec -it swss bash
root@lnos-x1-a-asw01:/# supervisorctl  status
arp_update                       STOPPED   Not started
buffermgrd                       RUNNING   pid 338, uptime 5:54:06
enable_counters                  EXITED    Oct 29 05:13 PM
intfmgrd                         RUNNING   pid 157, uptime 5:54:09
intfsyncd                        RUNNING   pid 128, uptime 5:54:14
kill_supervisor                  RUNNING   pid 18, uptime 5:54:20
neighsyncd                       RUNNING   pid 131, uptime 5:54:12
orchagent                        RUNNING   pid 47, uptime 5:54:16
portmgrd                         RUNNING   pid 168, uptime 5:54:07
portsyncd                        RUNNING   pid 59, uptime 5:54:15
rsyslogd                         RUNNING   pid 42, uptime 5:54:17
start.sh                         EXITED    Oct 29 05:12 PM
swssconfig                       EXITED    Oct 29 05:12 PM
vlanmgrd                         RUNNING   pid 145, uptime 5:54:10
vrfmgrd                          RUNNING   pid 369, uptime 5:54:04
root@lnos-x1-a-asw01:/# kill 168
root@lnos-x1-a-asw01:/# admin@lnos-x1-a-asw01:~$ 
admin@lnos-x1-a-asw01:~$ docker ps -a
CONTAINER ID        IMAGE                             COMMAND                  CREATED             STATUS                     PORTS               NAMES
42ff33678e91        docker-snmp-sv2:latest            "/usr/bin/supervisord"   3 days ago          Up 5 hours                                     snmp
bb88aacbaa88        docker-orchagent-brcm:latest      "/usr/bin/supervisord"   3 days ago          Exited (0) 5 seconds ago                       swss
4070498c4996        docker-syncd-brcm:latest          "/usr/bin/supervisord"   3 days ago          Exited (0) 3 seconds ago                       syncd
b52ea3733065        docker-dhcp-relay:latest          "/usr/bin/docker_init"   3 days ago          Exited (0) 2 seconds ago                       dhcp_relay
41c9bad718e0        docker-router-advertiser:latest   "/usr/bin/supervisord"   3 days ago          Exited (0) 1 seconds ago                       radv
19eefe013033        docker-sonic-telemetry:latest     "/usr/bin/supervisord"   3 days ago          Exited (0) 3 seconds ago                       telemetry
9ef5a032d189        docker-fpm-quagga:latest          "/usr/bin/supervisord"   3 days ago          Up 3 days                                      bgp
ca18b2d768f0        docker-lldp-sv2:latest            "/usr/bin/supervisord"   3 days ago          Up 3 days                                      lldp
5235acdf79d8        docker-platform-monitor:latest    "/usr/bin/supervisord"   3 days ago          Up 3 days                                      pmon
b2ccf545ba86        docker-teamd:latest               "/usr/bin/supervisord"   3 days ago          Up 3 days                                      teamd
b6da9b2fe970        docker-database:latest            "/usr/bin/supervisord"   3 days ago          Up 3 days                                      database

With "restart" features in every service, swss container got restarted gracefully.


admin@lnos-x1-a-asw01:~$ docker ps -a
CONTAINER ID        IMAGE                             COMMAND                  CREATED             STATUS              PORTS               NAMES
42ff33678e91        docker-snmp-sv2:latest            "/usr/bin/supervisord"   3 days ago          Up 47 seconds                           snmp
bb88aacbaa88        docker-orchagent-brcm:latest      "/usr/bin/supervisord"   3 days ago          Up 45 seconds                           swss
4070498c4996        docker-syncd-brcm:latest          "/usr/bin/supervisord"   3 days ago          Up 42 seconds                           syncd
b52ea3733065        docker-dhcp-relay:latest          "/usr/bin/docker_init"   3 days ago          Up 47 seconds                           dhcp_relay
41c9bad718e0        docker-router-advertiser:latest   "/usr/bin/supervisord"   3 days ago          Up 47 seconds                           radv
19eefe013033        docker-sonic-telemetry:latest     "/usr/bin/supervisord"   3 days ago          Up 47 seconds                           telemetry
9ef5a032d189        docker-fpm-quagga:latest          "/usr/bin/supervisord"   3 days ago          Up 3 days                               bgp
ca18b2d768f0        docker-lldp-sv2:latest            "/usr/bin/supervisord"   3 days ago          Up 3 days                               lldp
5235acdf79d8        docker-platform-monitor:latest    "/usr/bin/supervisord"   3 days ago          Up 3 days                               pmon
b2ccf545ba86        docker-teamd:latest               "/usr/bin/supervisord"   3 days ago          Up 3 days                               teamd
b6da9b2fe970        docker-database:latest            "/usr/bin/supervisord"   3 days ago          Up 3 days                               database

…s get crashed which causes respective docker container to stop.Then container will be restarted gracefully.
@stcheng
Copy link
Contributor

stcheng commented Oct 30, 2018

could you shorten the title of the commit? description words could be in the commit message instead of in the title.

@@ -65,3 +65,9 @@ stderr_logfile=syslog
{% endfor %}
{% endif %}
{% endif %}

[eventlistener:kill_supervisor]
Copy link
Contributor

@jleveque jleveque Oct 30, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the [eventlistener:...] section(s) should be listed immediately after the [supervisord] section; before any [program:...] sections. This applies to all supervisor config files. This way, the [eventlistener:...] section will be in a consistent position in the file for all containers.

@lguohan
Copy link
Collaborator

lguohan commented May 2, 2019

superseded by #2845

@lguohan lguohan closed this May 2, 2019
prsunny pushed a commit that referenced this pull request Mar 30, 2022
* [202012] - sonic-swss submodule update to include following commits:

fca407a (HEAD) [VNET]Fixing nexthop group delete during route change (#2198)
a9b6b47 [vxlan] Remove tunnel map objects on VNET tunnel removal (#2208)
74e9b9f [FdbOrch] SAI_FDB_EVENT_MOVE generates update with empty update.entry.port_name (#2201)
0a99445 [202012][BFD]Registering BFD state change callback during session creation (#2203)
aebe4a1 [VS test] skip dpb flaky test (#2195) (#2207)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants