diff --git a/doc/warm-reboot/SONiC_Warmboot.md b/doc/warm-reboot/SONiC_Warmboot.md new file mode 100644 index 0000000000..324a096b3b --- /dev/null +++ b/doc/warm-reboot/SONiC_Warmboot.md @@ -0,0 +1,392 @@ +# SONiC Warm Reboot + +Table of Contents +================= +* [Overview](#overview) +* [Use cases](#use-cases) + * [In\-Service restart](#in-service-restart) + * [Un\-Planned restart](#un-planned-restart) + * [BGP docker restart](#bgp-docker-restart) + * [SWSS docker restart](#swss-docker-restart) + * [Syncd docker restart](#syncd-docker-restart) + * [Teamd docker restart](#teamd-docker-restart) + * [In\-Service upgrade](#in-service-upgrade) + * [Case 1: without SAI api call change](#case-1-without-sai-api-call-change) + * [Case 2: with SAI api call change](#case-2-with-sai-api-call-change) + * [Case 2\.1 attribute change with SET](#case-21-attribute-change-with-set) + * [Case 2\.2 Object change with REMOVE](#case-22-object-change-with-remove) + * [Case 2\.3 Object change with CREATE](#case-23-object-change-with-create) + * [case 2\.3\.1 New SAI object](#case-231-new-sai-object) + * [case 2\.3\.2 Old object in previous version to be replaced with new object in new software version](#case-232-old-object-in-previous-version-to-be-replaced-with-new-object-in-new-software-version) + * [Cold restart fallback](#cold-restart-fallback) +* [Proposal 1: Reconciliation at Orchagent](#proposal-1-reconciliation-at-orchagent) + * [Key steps](#key-steps) + * [Restore to original state](#restore-to-original-state) + * [Remove stale date and perform new update](#remove-stale-date-and-perform-new-update) + * [States of ASIC/LibSAI, syncd, orchagent and applications become consistent and up to date\.](#states-of-asiclibsai-syncd-orchagent-and-applications-become-consistent-and-up-to-date) + * [Questions](#questions) + * [How syncd restores to the state of pre\-shutdown](#how-syncd-restores-to-the-state-of-pre-shutdown) + * [How Orchagent manages data dependencies during state restore](#how-orchagent-manages-data-dependencies-during-state-restore) + * [What is missing in Orchagent for it to restore to the state of pre\-shutdown](#what-is-missing-in-orchagent-for-it-to-restore-to-the-state-of-pre-shutdown) + * [How Orchagent gets the OID information](#how-orchagent-gets-the-oid-information) + * [How to handle the cases of SAI api call change during restore phase\.](#how-to-handle-the-cases-of-sai-api-call-change-during-restore-phase) + * [How to deal with the missing notification during the reboot/restart window](#how-to-deal-with-the-missing-notification-during-the-rebootrestart-window) + * [Requirements on LibSAI and ASIC](#requirements-on-libsai-and-asic) + * [Requirements on syncd](#requirements-on-syncd) + * [Requirement on network applications and orch data](#requirement-on-network-applications-and-orch-data) + * [General requirements](#general-requirements) + * [Port](#port) + * [Lag/teamd](#lagteamd) + * [Interface](#interface) + * [Fdb](#fdb) + * [Arp](#arp) + * [Route](#route) + * [Acl](#acl) + * [Buffer](#buffer) + * [Qos](#qos) + * [Summary](#summary) + * [Approach evaluation](#approach-evaluation) + * [Advantages](#advantages) + * [Concerns/Issues with this approach](#concernsissues-with-this-approach) +* [Proposal 2: Reconciliation at syncd](#proposal-2-reconciliation-at-syncd) + * [The existing syncd INIT/APPLY view framework](#the-existing-syncd-initapply-view-framework) + * [Invariants for view comparison](#invariants-for-view-comparison) + * [Switch internal objects discovered vis SAI get operation\.](#switch-internal-objects-discovered-vis-sai-get-operation) + * [Configured attribute values like VLAN id, interface IP and etc\.](#configured-attribute-values-like-vlan-id-interface-ip-and-etc) + * [View comparison logic](#view-comparison-logic) + * [Orchagent and network application layer processing](#orchagent-and-network-application-layer-processing) + * [Approach evaluation](#approach-evaluation-1) + * [Advantages](#advantages-1) + * [Concerns/Issues with this approach](#concernsissues-with-this-approach-1) +* [Open issues](#open-issues) + * [How to do version control for software upgrade at docker level?](#how-to-do-version-control-for-software-upgrade-at-docker-level) + * [Rollback support in SONiC](#rollback-support-in-sonic) + * [What is the requirement on control plane down time?](#what-is-the-requirement-on-control-plane-down-time) + * [Upgrade path with warm reboot support](#upgrade-path-with-warm-reboot-support) + * [Latency requirement on LibSAI/SDK warm restart](#latency-requirement-on-libsaisdk-warm-restart) + * [Backward compatibility requirement on SAI/LibSAI/SDK?](#backward-compatibility-requirement-on-sailibsaisdk) + * [What is the requirment on LibSAI/SDK with regards to data plane traffic during warm reboot? Could FDB be flushed?](#what-is-the-requirment-on-libsaisdk-with-regards-to-data-plane-traffic-during-warm-reboot-could-fdb-be-flushed) + * [What are the the principles of warm reboot support for SONiC?](#what-are-the-the-principles-of-warm-reboot-support-for-sonic) +* [References](#references) + + +# Overview +The goal of SONiC warm reboot is to be able restart and upgrade SONiC software without impacting the data plane. +Warm restart of each individual process/docker is also part of the goal. Except for syncd and database docker, it is desired for all other network applications and dockers to support un-planned warm restart. + +For restart processing, SONiC may be roughly divided into three layers: + +__Network applications and Orchagent__: Each application will experience similar processing flow. Application and corresponding orchagent sub modules need to work together to restore the orginal data and populate the delta for warm start. Take route as example, upon restart operation, network application BGP performs graceful restart and gets synchronized with the latest routing state via talking with peers, fpmsyncd uses the input from BGP to program appDB and it also deals with any stale/new routes besides those routes without change. RouteOrch responds to the operation requests from fpmsyncd and propagates any change down to syncd. + +__Syncd__: syncd should dump ASICDB before restart, and restore to the same state as pre-reboot. The restore of SONiC syncd itself should not disturb the state of ASIC. It takes changes from Orchagent and pass them down to LibSAI/ASIC after necessary transformation. + +__LibSAI/ASIC__: ASIC vendor needs to ensure the state of ASIC and libSAI restores to the same state as pre-reboot. + +![warm start layers](img/warm_start_layers.png) + +# Use cases + +## In-Service restart +The mechanism of restarting a component without impact to service. This assumes that the software version of the component has not changed after the restart. +There could be data changes like new/stale route, port state change, fdb change during restart window. + +Component here could be the whole SONiC system or just one or multiple of the dockers running in SONiC. + +### Un-Planned restart +It is desired for all network applications and orchagent to be able to handle unplanned restart, and restore gracefully. It is not a requirement on syncd and ASIC/LibSAI due to dependency on ASIC processing. + +### BGP docker restart +After BGP docker restart, new routes may be learned from BGP peers and some routes which had been pushed down to APPDB and ASIC may be gone. The system should be able to clear the stale route from APPDB down to ASIC and program the new route. + +### SWSS docker restart +After swss docker restart, all the port/LAG, vlan, interface, arp and route data should be restored from configDB, APPDB, Linux Kernel and other reliable sources. +There could be port state, ARP, FDB changes during the restart window, proper sync processing should be performed. + +### Syncd docker restart +The restart of syncd docker should leave data plane intact. After restart, syncd resumes control of ASIC/LibSAI and communication with swss docker. All other functions which run in syncd docker should be restored too like flexcounter processing. + +### Teamd docker restart +The restart of teamd docker should not cause link flapping or any traffic loss. All lags at data plane should remain the same. + +## In-Service upgrade +The mechanism of upgrading to a newer version of a component without impacting service. + +Component here could be the whole SONiC system or just one or multiple of the dockers running in SONiC. + +### Case 1: without SAI api call change +There are software changes in network applications like BGP, neighsyncd, portsyncd and even orchagent, but the changes don’t have impact on the interface with syncd as to the organization of existing data (meta data and dependency graph). +There could be data changes like new/stale route, port state change, fdb change during restart window. + +All the processing for [In-Service Restart](#in-service-restart) applies here too. + +### Case 2: with SAI api call change +#### Case 2.1 attribute change with SET +New version of orchagent may cause SET api to use a different value for certain attribute compared with previous version. Or a new attribute SET will be called. +#### Case 2.2 Object change with REMOVE +Object that existed in previous version may be deleted by default in new software version. +#### Case 2.3 Object change with CREATE +Two scenarios: +##### case 2.3.1 New SAI object +This is the new object defined at SAI layer and CREATE call is triggered at orchagent in new version of software. + +##### case 2.3.2 Old object in previous version to be replaced with new object in new software version +Ex. Object will be created with more or less attributes or different attribute value, or multiple instance objects will be replaced with an aggregated object. +This is the most complex scenario, all other objects which have dependency on the old object should be cleaned up properly if the old object is not a leaf object. + +## Cold restart fallback +An option to do cold restart or warm restart through configuration for swss, syncd and teamd dockers should be provided. +Upon failure of warm restart, fallback mechanism to cold restart should be available. + + +# Proposal 1: Reconciliation at Orchagent +## Key steps +### Restore to original state +`a.` LibSAI/ASIC is able to restore to the state of pre-reboot without interrupting upper layer. +`b.` Syncd is able to restore to the state of pre-reboot without interrupting ASIC and upper layer. +`c.` Syncd state is driven by Orchagent (with exception of FDB), once it is restored, no need to perform reconciliation by itself. + +### Remove stale date and perform new update + +`a.` Based on the individual behavior of each network application, it either reads data from configDB, or get data from other sources like Linux kernel( ex. for port, ARP) and BGP protocal, then programs APPDB again. It keeps track of any stale data for removal. +Orchagent consumes the request from APPDB. + +`b.` Orchagent restores data from APPDB for applications running in other dockers like BGP and teamd to be able to handle the case of swss only restart, and ACL data from configDB. Orchagent ensures idempotent operation at LibSaiRedis interface via not passing down any create/remove/set operations on objects that had been performed before. + +Please note that, to reduce the dependency wait time in orchagent, loose order control is helpful. Ex. the restore of route may be done after port, lag, interface and ARP data is (mostly) processed. + +Each application is responsible for gathering any delta between pre and after restart, and performs create(new object), set, or remove(stale object) operations for the delta data. + +`c.` Syncd processes the request from Orchagent as in normal boot. +### States of ASIC/LibSAI, syncd, orchagent and applications become consistent and up to date. + + +## Questions +### How syncd restores to the state of pre-shutdown +In this approach syncd only needs to save and restore the mapping between object RID and VID. + +### How Orchagent manages data dependencies during state restore +The constructor of each orchagent subroutine may work as normal startup. + +Each application reads configDB data or restores data from Linux kernel or re-populate data via network protocols uppon restart, and progams appDB accordingly. Each network application and orchagent subroutine handle the dependency accordingly, which means some operation may be delayed until all required objects are ready. The dependency check has been part of existing implementation in orchagent, but new issues may pop up with this new scenario. + +To be able to handle the case of swss only restart, orchagent also restores route (for BGP docker) and portchannel data (for teamd docker) from APPDB directly besides subscribing to appDB consumer channnel. Loose order control for the data restore helps speed up the processing. + +### What is missing in Orchagent for it to restore to the state of pre-shutdown +Orchagent and application may get data from configDB and APPDB as normal startup, but to be able to in sync and communicate with syncd, it also needs OID for each object with key type of sai_object_id_t. + +``` +typedef struct _sai_object_key_t +{ + union _object_key { + sai_object_id_t object_id; + sai_fdb_entry_t fdb_entry; + sai_neighbor_entry_t neighbor_entry; + sai_route_entry_t route_entry; + sai_mcast_fdb_entry_t mcast_fdb_entry; + sai_l2mc_entry_t l2mc_entry; + sai_ipmc_entry_t ipmc_entry; + sai_inseg_entry_t inseg_entry; + } key; +} sai_object_key_t; +``` + +### How Orchagent gets the OID information +For SAI redis create operation of those objects with object key type of sai_object_id_t, Orchagent must be able to use the exact same OID as before shutdown, otherwise it will be out of sync with syncd. But current Orchagent implementation save OID in running time data struct only. + +For object ID previously fetched via sai redis get operation, the same method still works. + +One possible solution is to save the mapping between OID and attr_list at redis_generic_create(). This assumes that during restore, exact same attr_list will be used for object create, so same OID may be found and returned. + +When there is attribute change for the first time, the original default mapping could be saved in DEFAULT_ATTR2OID_ and DEFAULT_OID2ATTR_ tables. This is because during restore, object create may use the default attributes instead of current attribues. + +All new changes will be applied on the regular ATTR2OID_ and OID2ATTR_ mapping tables. + +For the case of multiple objects created for the same set of attributes, an extra owner identifier may be assigned for the mapping from attributes to OID, so each object is uniquely identifieable based on the owner context. One prominent example is using lag_alias as the lag owner so each lag may retrieve the the same OID during restart though NULL attribute is provided for lag create. + +``` ++ SET_OBJ_OWNER(lag_alias); + sai_status_t status = sai_lag_api->create_lag(&lag_id, gSwitchId, 0, NULL); ++ UNSET_OBJ_OWNER(); +``` + +Virtual OID should not be necessary in this solution. But it doesn’t hurt either if the virtual OID layer is kept. + +Idempotency is required for LibSaiRedis interface. + +### How to handle the cases of SAI api call change during restore phase. +[Case 2\.1 attribute change with SET](#case-21-attribute-change-with-set) : at the sai_redis_generic_set layer, based on the object key, compare attribute value and apply the change directly down to syncd/libsai/ASIC. + +[Case 2\.2 Object change with REMOVE](#case-22-object-change-with-remove): at the sai_redis_gereric_remove layer, if the object key found in restoreDB, apply remove SAI api call directly down to syncd/libsai/ASIC. Dependency has been guaranteed at orchagent. + +[Case 2\.3 Object change with CREATE](#case-23-object-change-with-create): + +[case 2\.3\.1 New SAI object](#case-231-new-sai-object): Just apply the SAI API create operation down to syncd/libsai/ASIC. Dependency has been guaranteed at orchagent. But if it is not a leaf object, there will be cascading effect on other objects which has dependency on it when being created, which will be handled in next used case scenario. If the new SAI object is only used as an attribute in SET call for other objects, it could be handled in Case 2.1 attribute change with SET. + +[case 2\.3\.2 Old object in previous version to be replaced with new object in new software version](#case-232-old-object-in-previous-version-to-be-replaced-with-new-object-in-new-software-version): If this is a leaf object like route entry, neighbor entry, or fdb entry, just add version specific logic to remove it and create the new one. +Otherwise if there are other objects which have to use this object as one of the attributes during create call, those objects should be deleted first before deleting this old object. Version specific logic is needed here. + +### How to deal with the missing notification during the reboot/restart window +Port/fdb may have new state notification during reboot window? Probably the corresponding orchagent subroutine should perform get operation for the objects? + +## Requirements on LibSAI and ASIC +LibSAI and ASIC should be able to save all necessary state upon shutdown request with warm restart option. +Upon create_switch() request, LibSAI/ASIC should restore to the exact state of pre-shutdown. +Data plane should not be affected during the whole restore process. +Once restore is finished, LibSAI/ASIC works in normal operation state, they are agnostic of any warm restart processing happening in upper layer. +It is desired to support idempotency for create/remove/set in LibSAI, but may not be absolutely necessary for warm reboot solution. + +## Requirements on syncd +Syncd should be able to save all necessary state upon shutdown request with warm restart option. +At the restart syncd should restore to the exact state of pre-shutdown. +Once restore is finished, syncd works in normal operation state, it is agnostic of any warm restart processing happening in upper layer. + +## Requirement on network applications and orch data +### General requirements + +Each application should be able to restore to the state of pre-shutdown. + +Orchagent must be able to save and restore OID for objects created by Orchagent and with object key type of sai_object_id_t. Other objects not created by Orchagent may restore OID via get operation of libsairedis interfaces. + +The orchagent sub-routine of each application could use existing normal constructor and producerstate/consumerstate handling flow to ensure dependency and populate internal data structure. + +In case docker restart of swss only, it should be able to restore route and lag data from appDB directly since bgp docker and teamd docker wouldn't provision the whole set of data again to appDB in this scenario. + +After state restore, each application should be able to remove any stale object/state and perform any needed create/set, orchagent process the request as normal. + +### Port +### Lag/teamd +### Interface +### Fdb +### Arp +### Route +### Acl +### Buffer +### Qos +### … + +## Summary + +| Layer | Restore |Reconciliation |Idempotency |Dependency management +| ---- | ---- | ---- | ---- | ---- | +| Application/Orchagent| Y | Y |Y for LibSaiRedis interface | Y +| Syncd | Y | N |Good to have |Good to have +| LibSAI/ASIC | Y | N |Good to have |Good to have + + +## Approach evaluation +### Advantages +* straightforward logic, simple to implement for most upgrade/restart cases. +* Layer/applications decoupled, easy to divide and conquer. +* Each docker self contained, is prepared for unplanned warm restart of swss process and other network applications. + +### Concerns/Issues with this approach +* Orchagent software upgrade could be handy, especially for the cases of SAI object replace which requires Orchagent to have use-once code to handle them for in service upgrade. + +# Proposal 2: Reconciliation at syncd +## The existing syncd INIT/APPLY view framework +Essentially there will be two views created for warm restart. The current view represents the ASIC state before shutdown, temp view represents the new intended ASIC state after restart. +Based on the SAI object data model, each view is a directed acyclic graph, all(?) objects are linked together. +### Invariants for view comparison +#### Switch internal objects discovered vis SAI get operation. +They include SAI_OBJECT_TYPE_PORT, SAI_OBJECT_TYPE_QUEUE, SAI_OBJECT_TYPE_SCHEDULER_GROUP, SAI_OBJECT_TYPE_SCHEDULER_GROUP and a few more. +It is assumed that the RID/VID for those objects keep the same. +`Question 1`: what if there is change with those discovered object after version change? + +`Question 2`: what if some of the discovered objects got changed? Like dynamic port breakout case. + +#### Configured attribute values like VLAN id, interface IP and etc. + There could be change to the configured value, those not being changed may work as invariants. +`Question 3`: could some virtual OIDs for created objects in tmp view coincidently match with object in current view, but the objects are different? matchOids(). + +### View comparison logic +Utilizing the meta data of object, with those invariants as anchor points, for each object in temp view, it starts as root of a tree and go down to all layer of children node until leaf to find best match. +If no match is found, the object in temp view should be created, it is object CREATE operation. +If best match is found, but there is attributes different between the object in temp view and current view, SET operation should be performed. +Exact match yields Temp VID to Current VID translation, which also paves the way for upper layer comparison. +All objects in current VIEW which have reference count 0 at the end should be deleted, REMOVE operation. + +`Question 4`: how to handle two objects with exactly same attributes? Example: overlay loopback RIF and underlay loopback RIF. VRF and possibly some other object in same situation? + +`Question 5`: New version of software call create() API with one extra attribute, how will that be handled? Old way of create() plus set() for the extra attribute, or delete the existing object then create a brand new one? + +`Question 6`: findCurrentBestMatchForGenericObject(), the method looks dynamic. What we need is deterministic processing which matches exactly what orchagent will do (if same operation is to be done there instead), no new unnecessary REMOVE/SET/CREATE, how to guarantee that? + +## Orchagent and network application layer processing +Except for the idempotency support of create/set/remove operation at libsairedis interface, this proposal requires the same processing as in proposal 1, like original data restore and appDB stale data removal by each individual applications as needed. + +One possible but kind of extreme solution is: Always flush all related appDB tables or even the whole appDB when there is application restart, and let each application re-populate new data from scratch. The new set of data is then pushed down to syncd. syncd does the comparison logic between the old data and new data. + +## Approach evaluation +### Advantages +* Generic processing based on SAI object model. +* No change to libsairedis library implementation, no need to restore OID at orchagent layer. + +### Concerns/Issues with this approach +* Highly complex logic in syncd +* Warm restart of upper layer applications closely coupled with syncd. +* Various corner cases from SAI object model and changes in SAI object model itself have to be handled. + +# Open issues + +## How to do version control for software upgrade at docker level? + +`Show version` command is able to retrieve the version data for each docker. Furher extention may be based on that. + +``` +root@PSW-A2-16-A02.NA62:/home/admin# show version +SONiC Software Version: SONiC.130-14f14a1 +Distribution: Debian 8.1 +Kernel: 3.16.0-4-amd64 +Build commit: 14f14a1 +Build date: Wed May 23 09:12:22 UTC 2018 +Built by: jipan@ubuntu01 + +Docker images: +REPOSITORY TAG IMAGE ID SIZE +docker-fpm-quagga latest 0f631e0fb8d0 390.4 MB +docker-syncd-brcm 130-14f14a1 4941b40cc8e7 444.4 MB +docker-syncd-brcm latest 4941b40cc8e7 444.4 MB +docker-orchagent-brcm 130-14f14a1 40d4a1c08480 386.6 MB +docker-orchagent-brcm latest 40d4a1c08480 386.6 MB +docker-lldp-sv2 130-14f14a1 f32d15dd4b77 382.7 MB +docker-lldp-sv2 latest f32d15dd4b77 382.7 MB +docker-dhcp-relay 130-14f14a1 df7afef22fa0 378.2 MB +docker-dhcp-relay latest df7afef22fa0 378.2 MB +docker-database 130-14f14a1 a4a6ba6874c7 377.7 MB +docker-database latest a4a6ba6874c7 377.7 MB +docker-snmp-sv2 130-14f14a1 89d249faf6c4 444 MB +docker-snmp-sv2 latest 89d249faf6c4 444 MB +docker-teamd 130-14f14a1 b127b2dd582d 382.8 MB +docker-teamd latest b127b2dd582d 382.8 MB +docker-sonic-telemetry 130-14f14a1 89f4e1bb1ede 396.1 MB +docker-sonic-telemetry latest 89f4e1bb1ede 396.1 MB +docker-router-advertiser 130-14f14a1 6c90b2951c2c 375.4 MB +docker-router-advertiser latest 6c90b2951c2c 375.4 MB +docker-platform-monitor 130-14f14a1 29ef746feb5a 397 MB +docker-platform-monitor latest 29ef746feb5a 397 MB +docker-fpm-quagga 130-14f14a1 5e87d0ae9190 389.4 MB +``` +## Rollback support in SONiC +This is a general requirement not limited to warm reboot. Probably a separate design document should be prepared for this topic. + +## What is the requirement on control plane down time? +Currently there is no hard requirement on the down time of control plane during warm reboot. An appropriate number should be agreed on. + +## Upgrade path with warm reboot support +No clear requirement available yet. The general idea is to support warm reboot between consecutive SONiC releases. + +## Latency requirement on LibSAI/SDK warm restart +No strict requuirment on this layer yet. Probably in the order of seconds, say, 10 seconds? + +## Backward compatibility requirement on SAI/LibSAI/SDK? +Yes, Backward compatibility is mandatory for warm reboot support. + +## What is the requirment on LibSAI/SDK with regards to data plane traffic during warm reboot? Could FDB be flushed? +No packet loss at data plane for existing data flow. In general, FDB flush should be triggered by NOS instread of LibSAI/SDK. + +## What are the the principles of warm reboot support for SONiC? +One of the priciples talked about is have warm restart support at each layer/module/docker, each layer/module/docker is self contained as to warm restart. + +# References +* [SAI Warmboot spec](https://github.com/opencomputeproject/SAI/blob/master/doc/SAI_Proposal_Warmboot.docx?raw=true) + diff --git a/doc/warm-reboot/SONiC_Warmboot_v0.4.docx b/doc/warm-reboot/SONiC_Warmboot_v0.4.docx deleted file mode 100644 index ff7eee6aa6..0000000000 Binary files a/doc/warm-reboot/SONiC_Warmboot_v0.4.docx and /dev/null differ diff --git a/doc/warm-reboot/code_implementation.md b/doc/warm-reboot/code_implementation.md new file mode 100644 index 0000000000..4bd451d284 --- /dev/null +++ b/doc/warm-reboot/code_implementation.md @@ -0,0 +1,222 @@ + +# SWSS docker warm restart code reference + +Table of Contents +================= + +* [SWSS docker warm restart code reference](#swss-docker-warm-restart-code-reference) +* [Table of Contents](#table-of-contents) +* [Basic testing](#basic-testing) + * [enable/disable swss warm upgrade](#enabledisable-swss-warm-upgrade) + * [swss docker upgrade](#swss-docker-upgrade) + * [Virtual switch test](#virtual-switch-test) +* [Separate syncd and swss services, and warm start configDB support](#separate-syncd-and-swss-services-and-warm-start-configdb-support) +* [swss\-flushdb script support](#swss-flushdb-script-support) +* [swss data restore](#swss-data-restore) +* [RedisClient hmset and hgetallordered library support\.](#redisclient--hmset-and-hgetallordered-library-support) +* [libsari redis API idempotency support](#libsari-redis-api-idempotency-support) + +** Note: This document is temporary. The code implementations are for reference only. Active development and testing is in progress ** + +# Basic testing + +## enable/disable swss warm upgrade +``` +root@sonic:~# config warm_restart enable swss + +root@sonic:~# show warm_restart +WARM_RESTART teamd enable false +WARM_RESTART swss neighbor_timer 5 +WARM_RESTART swss enable true +WARM_RESTART system enable false +``` + +## swss docker upgrade + +`sonic_installer upgrade_docker` command line may be used to upgrade swss docker to a new docker image without affect data plane traffic. + +``` +sonic_installer upgrade_docker --help +Usage: sonic_installer upgrade_docker [OPTIONS] URL + + Upgrade docker image from local binary or URL + +Options: + -y, --yes + --cleanup_image Clean up old docker images + --help Show this message and exit. +``` + +Upgrade example: +``` +root@sonic:~# docker images +REPOSITORY TAG IMAGE ID CREATED SIZE +docker-orchagent-brcm latest e322b31c1ad6 21 hours ago 296.9 MB +docker-orchagent-brcm test_v02 e322b31c1ad6 21 hours ago 296.9 MB +docker-fpm-quagga latest afcd2237e510 2 days ago 303.4 MB +docker-fpm-quagga warm-reboot.0-dirty-20180709.225823 afcd2237e510 2 days ago 303.4 MB +docker-teamd latest 54296354b8a1 2 days ago 296.6 MB +docker-teamd warm-reboot.0-dirty-20180709.225823 54296354b8a1 2 days ago 296.6 MB +docker-syncd-brcm latest 293b435f8f48 2 days ago 375.9 MB +docker-syncd-brcm warm-reboot.0-dirty-20180709.225823 293b435f8f48 2 days ago 375.9 MB +docker-snmp-sv2 latest 5a9965d51534 2 days ago 330.6 MB +docker-snmp-sv2 warm-reboot.0-dirty-20180709.225823 5a9965d51534 2 days ago 330.6 MB +docker-lldp-sv2 latest 7ed919240fb9 2 days ago 306 MB +docker-lldp-sv2 warm-reboot.0-dirty-20180709.225823 7ed919240fb9 2 days ago 306 MB +docker-platform-monitor latest cfb9af72dc57 2 days ago 317 MB +docker-platform-monitor warm-reboot.0-dirty-20180709.225823 cfb9af72dc57 2 days ago 317 MB +docker-database latest c61388ef5d4b 2 days ago 291.8 MB +docker-database warm-reboot.0-dirty-20180709.225823 c61388ef5d4b 2 days ago 291.8 MB +docker-dhcp-relay latest cf68d734ec21 2 days ago 293.1 MB +docker-dhcp-relay warm-reboot.0-dirty-20180709.225823 cf68d734ec21 2 days ago 293.1 MB +docker-router-advertiser latest 8e69dcfe794d 2 days ago 289.4 MB +docker-router-advertiser warm-reboot.0-dirty-20180709.225823 8e69dcfe794d 2 days ago 289.4 MB +root@sonic:~# sonic_installer upgrade_docker swss test_v03 docker-orchagent-brcm_v03.gz --cleanup_image +New docker image will be installed, continue? [y/N]: y +Command: systemctl stop swss + +Command: docker rm swss +swss + +Command: docker rmi docker-orchagent-brcm:latest +Untagged: docker-orchagent-brcm:latest + +Command: docker load < ./docker-orchagent-brcm_v03.gz + +Command: docker tag docker-orchagent-brcm:latest docker-orchagent-brcm:test_v03 + +Command: systemctl restart swss + +set(['e322b31c1ad6']) +Command: docker rmi -f e322b31c1ad6 +Untagged: docker-orchagent-brcm:test_v02 +Deleted: sha256:e322b31c1ad6e12b27a9683fc64e0ad9a63484127d06cddf277923e9d7c37419 +Deleted: sha256:bcf19d6c92edd7bcf63b529f341008532b9272d30de2f206ad47728d3393cad4 + +Command: sleep 5 + +Done +root@sonic:~# docker images +REPOSITORY TAG IMAGE ID CREATED SIZE +docker-orchagent-brcm latest 790e060184bb 21 hours ago 296.9 MB +docker-orchagent-brcm test_v03 790e060184bb 21 hours ago 296.9 MB +docker-fpm-quagga latest afcd2237e510 2 days ago 303.4 MB +docker-fpm-quagga warm-reboot.0-dirty-20180709.225823 afcd2237e510 2 days ago 303.4 MB +docker-teamd latest 54296354b8a1 2 days ago 296.6 MB +docker-teamd warm-reboot.0-dirty-20180709.225823 54296354b8a1 2 days ago 296.6 MB +docker-syncd-brcm latest 293b435f8f48 2 days ago 375.9 MB +docker-syncd-brcm warm-reboot.0-dirty-20180709.225823 293b435f8f48 2 days ago 375.9 MB +docker-snmp-sv2 latest 5a9965d51534 2 days ago 330.6 MB +docker-snmp-sv2 warm-reboot.0-dirty-20180709.225823 5a9965d51534 2 days ago 330.6 MB +docker-lldp-sv2 latest 7ed919240fb9 2 days ago 306 MB +docker-lldp-sv2 warm-reboot.0-dirty-20180709.225823 7ed919240fb9 2 days ago 306 MB +docker-platform-monitor latest cfb9af72dc57 2 days ago 317 MB +docker-platform-monitor warm-reboot.0-dirty-20180709.225823 cfb9af72dc57 2 days ago 317 MB +docker-database latest c61388ef5d4b 2 days ago 291.8 MB +docker-database warm-reboot.0-dirty-20180709.225823 c61388ef5d4b 2 days ago 291.8 MB +docker-dhcp-relay latest cf68d734ec21 2 days ago 293.1 MB +docker-dhcp-relay warm-reboot.0-dirty-20180709.225823 cf68d734ec21 2 days ago 293.1 MB +docker-router-advertiser latest 8e69dcfe794d 2 days ago 289.4 MB +docker-router-advertiser warm-reboot.0-dirty-20180709.225823 8e69dcfe794d 2 days ago 289.4 MB + +``` + + + +`systemctl restart swss` or ` sonic_installer upgrade_docker` won't affect data plane traffic and new provisioning works well. + +``` +127.0.0.1:6379> keys WAR* +1) "WARM_RESTART_TABLE:portsyncd" +2) "WARM_RESTART_TABLE:neighsyncd" +3) "WARM_RESTART_TABLE:vlanmgrd" +4) "WARM_RESTART_TABLE:orchagent" + +127.0.0.1:6379> hgetall "WARM_RESTART_TABLE:orchagent" +1) "restart_count" +2) "4" +127.0.0.1:6379> hgetall "WARM_RESTART_TABLE:neighsyncd" +1) "restart_count" +2) "4 + +``` + +## Virtual switch test + +``` +jipan@sonic-build:~/igbpatch/vs/sonic-buildimage/src/sonic-swss/tests$ sudo pytest -v --dvsname=vs --notempview +[sudo] password for jipan: +====================================================================== test session starts ======================================================================= +platform linux2 -- Python 2.7.12, pytest-3.3.0, py-1.5.2, pluggy-0.6.0 -- /usr/bin/python +cachedir: .cache +rootdir: /home/jipan/igbpatch/vs/sonic-buildimage/src/sonic-swss/tests, inifile: +collected 45 items + +test_acl.py::TestAcl::test_AclTableCreation PASSED [ 2%] +test_acl.py::TestAcl::test_AclRuleL4SrcPort PASSED [ 4%] +test_acl.py::TestAcl::test_AclTableDeletion PASSED [ 6%] +test_acl.py::TestAcl::test_V6AclTableCreation PASSED [ 8%] +test_acl.py::TestAcl::test_V6AclRuleIPv6Any PASSED [ 11%] +test_acl.py::TestAcl::test_V6AclRuleIPv6AnyDrop PASSED [ 13%] +test_acl.py::TestAcl::test_V6AclRuleIpProtocol PASSED [ 15%] +test_acl.py::TestAcl::test_V6AclRuleSrcIPv6 PASSED [ 17%] +test_acl.py::TestAcl::test_V6AclRuleDstIPv6 PASSED [ 20%] +test_acl.py::TestAcl::test_V6AclRuleL4SrcPort PASSED [ 22%] +test_acl.py::TestAcl::test_V6AclRuleL4DstPort PASSED [ 24%] +test_acl.py::TestAcl::test_V6AclRuleTCPFlags PASSED [ 26%] +test_acl.py::TestAcl::test_V6AclRuleL4SrcPortRange PASSED [ 28%] +test_acl.py::TestAcl::test_V6AclRuleL4DstPortRange PASSED [ 31%] +test_acl.py::TestAcl::test_V6AclTableDeletion PASSED [ 33%] +test_acl.py::TestAcl::test_InsertAclRuleBetweenPriorities PASSED [ 35%] +test_acl.py::TestAcl::test_AclTableCreationOnLAGMember PASSED [ 37%] +test_acl.py::TestAcl::test_AclTableCreationOnLAG PASSED [ 40%] +test_acl.py::TestAcl::test_AclTableCreationBeforeLAG PASSED [ 42%] +test_crm.py::test_CrmFdbEntry PASSED [ 44%] +test_crm.py::test_CrmIpv4Route PASSED [ 46%] +test_crm.py::test_CrmIpv6Route PASSED [ 48%] +test_crm.py::test_CrmIpv4Nexthop PASSED [ 51%] +test_crm.py::test_CrmIpv6Nexthop PASSED [ 53%] +test_crm.py::test_CrmIpv4Neighbor PASSED [ 55%] +test_crm.py::test_CrmIpv6Neighbor PASSED [ 57%] +test_crm.py::test_CrmNexthopGroup PASSED [ 60%] +test_crm.py::test_CrmNexthopGroupMember PASSED [ 62%] +test_crm.py::test_CrmAcl PASSED [ 64%] +test_dirbcast.py::test_DirectedBroadcast PASSED [ 66%] +test_fdb.py::test_FDBAddedAfterMemberCreated PASSED [ 68%] +test_interface.py::test_InterfaceIpChange PASSED [ 71%] +test_nhg.py::test_route_nhg PASSED [ 73%] +test_port.py::test_PortNotification PASSED [ 75%] +test_portchannel.py::test_PortChannel PASSED [ 77%] +test_route.py::test_RouteAdd PASSED [ 80%] +test_setro.py::test_SetReadOnlyAttribute PASSED [ 82%] +test_speed.py::TestSpeedSet::test_SpeedAndBufferSet PASSED [ 84%] +test_vlan.py::test_VlanMemberCreation PASSED [ 86%] +test_vrf.py::test_VRFOrch_Comprehensive PASSED [ 88%] +test_vrf.py::test_VRFOrch PASSED [ 91%] +test_vrf.py::test_VRFOrch_Update PASSED [ 93%] +test_warm_reboot.py::test_swss_warm_restore PASSED [ 95%] +test_warm_reboot.py::test_swss_port_state_syncup PASSED [ 97%] +test_warm_reboot.py::test_swss_fdb_syncup_and_crm PASSED [100%] + +================================================================== 45 passed in 630.11 seconds =================================================================== +jipan@sonic-build:~/igbpatch/vs/sonic-buildimage/src/sonic-swss/tests$ +``` + +# Separate syncd and swss services, and warm start configDB support +https://github.com/Azure/sonic-buildimage/compare/master...jipanyang:warm-reboot + +# swss-flushdb script support +https://github.com/Azure/sonic-utilities/compare/master...jipanyang:swss-warm-restart + +# swss data restore +https://github.com/Azure/sonic-swss/compare/master...jipanyang:idempotent + +# RedisClient hmset and hgetallordered library support. +https://github.com/Azure/sonic-swss-common/compare/master...jipanyang:idempotent + +# libsari redis API idempotency support +https://github.com/Azure/sonic-sairedis/compare/master...jipanyang:idempotent + + + + diff --git a/doc/warm-reboot/img/libsairedis_idempotent.png b/doc/warm-reboot/img/libsairedis_idempotent.png new file mode 100644 index 0000000000..b4efe19ec4 Binary files /dev/null and b/doc/warm-reboot/img/libsairedis_idempotent.png differ diff --git a/doc/warm-reboot/img/swss_data_source.png b/doc/warm-reboot/img/swss_data_source.png new file mode 100644 index 0000000000..3737a4065e Binary files /dev/null and b/doc/warm-reboot/img/swss_data_source.png differ diff --git a/doc/warm-reboot/img/warm_start_layers.png b/doc/warm-reboot/img/warm_start_layers.png new file mode 100644 index 0000000000..afb6a6c16c Binary files /dev/null and b/doc/warm-reboot/img/warm_start_layers.png differ diff --git a/doc/warm-reboot/sai_redis_api_idempotence.md b/doc/warm-reboot/sai_redis_api_idempotence.md new file mode 100644 index 0000000000..69f3b7235b --- /dev/null +++ b/doc/warm-reboot/sai_redis_api_idempotence.md @@ -0,0 +1,190 @@ + +# SONiC libsairedis API idempotence support + +Table of Contents +================= + +* [Overview](#overview) +* [Libsairedis API operations](#libsairedis-api-operations) +* [Cache of operation data](#cache-of-operation-data) + * [Attributes to OID mapping](#attributes-to-oid-mapping) + * [Current attributes to OID mapping](#current-attributes-to-oid-mapping) + * [Default attributes to OID mapping](#default-attributes-to-oid-mapping) + * [OID to attributes mapping](#oid-to-attributes-mapping) + * [Default objects mapping](#default-objects-mapping) +* [Performance tunning](#performance-tunning) + * [In memory cache of the mapping](#in-memory-cache-of-the-mapping) + * [Optimize current producer/consumer channel](#optimize-current-producerconsumer-channel) + * [Multiple redis instance support](#multiple-redis-instance-support) + * [Serialization/deserialization](#serializationdeserialization) + + +# Overview +Libsairedis API interface is used by orchagent to interact with syncd for ASIC programing and data retrieval. To support warm restart of swss docker, making the libsairedis API call idempotent will greatly facilitate the state restore of orchagent. + + +# Libsairedis API operations +There are four types of operations: create, set, remove and get. + +Ideally the operations at libsai/sdk layer are idempotent, orchagent can make the same call repeatedlly without disrupting data plane services. While that will put strict requirement on libsai/sdk implementation and may not be possible in the near future, we'd like to restrain any duplicate create, set and remove operation from being pushed down to syncd/libsai/ASIC. In the perspective of orchagent, the libsairedis API is idempotent. Get operation towards libsai/ASIC is treated as harmless and duplicate get operation may pass through. + +Below is an example of idempotent libsairedis create operation: + +![idempotent libsairedis API](img/libsairedis_idempotent.png) + + +# Cache of operation data +In order to avoid propagating duplicate create/set/remove operations down to syncd, libsairedis needs to cache the very first operation data and filter out the following duplicates. Also to support warm restart, the cache is to be saved in redis db. + +There are five types of cache to support idempotent operations at libsairedis layer. + +## Attributes to OID mapping +Each sai object has a key to be uniquely identified. For objects which are created by orchagent and have key type of sai_object_id_t, we need to ensure that same set of attributes will yield the same OID at libsairedis create API call. + +### Current attributes to OID mapping +``` +#define ATTR2OID_PREFIX ("ATTR2OID_" + (g_objectOwner)) +``` +ATTR2OID_PREFIX is the prefix for mapping from attributes to OID. When a create request for oject key type of sai_object_id_t reaches libsairedis layer, the attributes provided will put together in order and a lookup key is formed using ATTR2OID_PREFIX as prefix. If there is an entry existing for the lookup key, corresponding OID value is returned directly without going down further. + +Note that g_objectOwner is a string value set by some application to distinguish objects which are created with same attributes. + +Ex. for underlay and overlay router interfaces created by orchagent, both of them use the same loopback router interface and virtural router as attributes, different owners are set so they may retrieve the original OIDs without confusing each other. + +In the example below, the owner of underlay router interface is "UNDERLAY_INTERFACE_" and it has OID of 0x6000000000939. While for overlay router interface, it is owned by "OVERLAY_INTERFACE_" and its OID is 0x6000000000996. + +``` +127.0.0.1:6379[7]> keys ATTR2OID_*LAY_INTERFACE_* +1) "ATTR2OID_OVERLAY_INTERFACE_SAI_ROUTER_INTERFACE_ATTR_TYPE=SAI_ROUTER_INTERFACE_TYPE_LOOPBACK|SAI_ROUTER_INTERFACE_ATTR_VIRTUAL_ROUTER_ID=oid:0x300000000003a" +2) "ATTR2OID_UNDERLAY_INTERFACE_SAI_ROUTER_INTERFACE_ATTR_TYPE=SAI_ROUTER_INTERFACE_TYPE_LOOPBACK|SAI_ROUTER_INTERFACE_ATTR_VIRTUAL_ROUTER_ID=oid:0x300000000003a" + +127.0.0.1:6379[7]> +127.0.0.1:6379[7]> hgetall "ATTR2OID_OVERLAY_INTERFACE_SAI_ROUTER_INTERFACE_ATTR_TYPE=SAI_ROUTER_INTERFACE_TYPE_LOOPBACK|SAI_ROUTER_INTERFACE_ATTR_VIRTUAL_ROUTER_ID=oid:0x300000000003a" +1) "SAI_OBJECT_TYPE_ROUTER_INTERFACE:oid:0x6000000000996" +2) "NULL" +127.0.0.1:6379[7]> hgetall "ATTR2OID_UNDERLAY_INTERFACE_SAI_ROUTER_INTERFACE_ATTR_TYPE=SAI_ROUTER_INTERFACE_TYPE_LOOPBACK|SAI_ROUTER_INTERFACE_ATTR_VIRTUAL_ROUTER_ID=oid:0x300000000003a" +1) "SAI_OBJECT_TYPE_ROUTER_INTERFACE:oid:0x6000000000939" +2) "NULL" +``` + +### Default attributes to OID mapping +``` +#define DEFAULT_ATTR2OID_PREFIX ("DEFAULT_ATTR2OID_" + (g_objectOwner)) +#define DEFAULT_OID2ATTR_PREFIX "DEFAULT_OID2ATTR_" +``` +After an object with key type of sai_object_id_t is created, the attributes for it may be changed later. The current attributes to OID mapping will be updated to reflect this changes. For warm restart, the same original default attributes list may be used when the object is created again. To be able to handle such cases, whenever a new SET request which will cause the orginial default attributes list changed, a separate default attributes to OID mapping will be created. This happens up to one time for each object, any more attributes SET change will just update the current attributes to OID mapping, this is ensured by checking the existence "DEFAULT_OID2ATTR_" mapping using OID as lookup key. + +Take hostif object as example, hostif Ethernet18 was originally created with attributes SAI_HOSTIF_ATTR_NAME, SAI_HOSTIF_ATTR_OBJ_ID and SAI_HOSTIF_ATTR_TYPE. After vlan provisioning on Ethernet18, SAI_HOSTIF_ATTR_VLAN_TAG is set, a new default attributes to OID mapping is created to save the original mapping while current attributes to OID mapping will be updated to include SAI_HOSTIF_ATTR_VLAN_TAG attribute. + +During warm restart, if the lookup failed with the current attributes to OID mapping, default attributes to OID mapping will be checked then. + +``` +127.0.0.1:6379[7]> hgetall "DEFAULT_ATTR2OID_SAI_HOSTIF_ATTR_NAME=Ethernet18|SAI_HOSTIF_ATTR_OBJ_ID=oid:0x1000000000014|SAI_HOSTIF_ATTR_TYPE=SAI_HOSTIF_TYPE_NETDEV" +1) "SAI_OBJECT_TYPE_HOSTIF:oid:0xd000000000952" +2) "NULL" + +127.0.0.1:6379[7]> hgetall DEFAULT_OID2ATTR_SAI_OBJECT_TYPE_HOSTIF:oid:0xd000000000952 +1) "SAI_HOSTIF_ATTR_NAME" +2) "Ethernet18" +3) "SAI_HOSTIF_ATTR_OBJ_ID" +4) "oid:0x1000000000014" +5) "SAI_HOSTIF_ATTR_TYPE" +6) "SAI_HOSTIF_TYPE_NETDEV" + +127.0.0.1:6379[7]> hgetall "ATTR2OID_SAI_HOSTIF_ATTR_NAME=Ethernet18|SAI_HOSTIF_ATTR_OBJ_ID=oid:0x1000000000014|SAI_HOSTIF_ATTR_TYPE=SAI_HOSTIF_TYPE_NETDEV|SAI_HOSTIF_ATTR_VLAN_TAG=SAI_HOSTIF_VLAN_TAG_KEEP" +1) "SAI_OBJECT_TYPE_HOSTIF:oid:0xd000000000952" +2) "NULL" +``` + +## OID to attributes mapping +``` +#define OID2ATTR_PREFIX "OID2ATTR_" +``` +For SET and REMOVE operations, the OID to attributes mapping is used to find the existing data for the object. If no change is found, then it is a duplicate operation, just return success. + +Objects with key type of non sai_object_id_t like route entry, neighbor and fdb only have entries in this mapping. + +## Default objects mapping +There are default objects created by libsai/SDK. Orchagent may change the attributes for them. The default objects mapping will save the mapping from object key to latest attributes. During warm restart, the same attributes SET on default objects return directly. +``` +/* + * For objects created by asic SDK/Libsai and changed by orchagent. + */ +#define DEFAULT_OBJ_PREFIX "DEFAULT_OBJ_" +``` + +# Performance tunning + +(TODO:) + +## In memory cache of the mapping +To speed up the lookup performance of the mapping tables, an in memory cache may be implemented to avoid redis hget operation. + +## Optimize current producer/consumer channel + +Based on the redis benchmark data shown below, with one redis client, the number of write operations per second is around 10k. +Current implementation of multiple LPUSH/LRANGE/LRIM for key, op, value may be combined into one LPUSH/LRANGE/LRIM. +It is expected that with this change, the performance of libsairedis API may be better though redis pipeline already reduced the cost of single operation. + +``` +root@sonic:/# /etc/sonic/redis-benchmark -q -n 100000 -c 1 +PING_INLINE: 10279.61 requests per second +PING_BULK: 10702.05 requests per second +SET: 9086.78 requests per second +GET: 10779.35 requests per second +INCR: 9184.42 requests per second +LPUSH: 9257.54 requests per second +RPUSH: 9433.07 requests per second +LPOP: 8948.55 requests per second +RPOP: 9074.41 requests per second +SADD: 10519.67 requests per second +SPOP: 11586.14 requests per second +LPUSH (needed to benchmark LRANGE): 9450.90 requests per second +LRANGE_100 (first 100 elements): 4533.91 requests per second +LRANGE_300 (first 300 elements): 2608.14 +LRANGE_300 (first 300 elements): 2584.91 requests per second +LRANGE_500 (first 450 elements): 1850.24 requests per second +LRANGE_600 (first 600 elements): 1493.65 requests per second +MSET (10 keys): 4951.97 requests per second + +root@sonic:/# lscpu +Architecture: x86_64 +CPU op-mode(s): 32-bit, 64-bit +Byte Order: Little Endian +CPU(s): 4 +On-line CPU(s) list: 0-3 +Thread(s) per core: 1 +Core(s) per socket: 4 +Socket(s): 1 +NUMA node(s): 1 +Vendor ID: GenuineIntel +CPU family: 6 +Model: 77 +Model name: Intel(R) Atom(TM) CPU C2558 @ 2.40GHz +Stepping: 8 +CPU MHz: 1200.000 +CPU max MHz: 2400.0000 +CPU min MHz: 1200.0000 +BogoMIPS: 4787.75 +Virtualization: VT-x +L1d cache: 24K +L1i cache: 32K +L2 cache: 1024K +NUMA node0 CPU(s): 0-3 +``` + +## Multiple redis instance support +Considering the benchmark data, single redis instance may not sustain the extreme scenario like route flapping. For DB like countersDB, it should be better to run in a separate redis instance. + +All the mapping for libsairedis API idempotence will be located in restoreDB. If multiple redis instances is supported and in memory cache is ready, we may consider making the mapping DB write asynchronous so separate redis instance could be better utilized for this use case. +``` +#define RESTORE_DB 7 +``` + +## Serialization/deserialization +It looks a lot of resource has been consumed by the serialization/deserialization processing. There might be room for optimization in this area. + + + + + diff --git a/doc/warm-reboot/swss_warm_restart.md b/doc/warm-reboot/swss_warm_restart.md new file mode 100644 index 0000000000..7864a955c6 --- /dev/null +++ b/doc/warm-reboot/swss_warm_restart.md @@ -0,0 +1,164 @@ + +# SONiC SWSS docker warm restart + +Table of Contents +================= + +* [Overview](#overview) +* [Input Data for swss](#input-data-for-swss) + * [configDB](#configdb) + * [Linux Kernel](#linux-kernel) + * [Teamd and teamsyncd](#teamd-and-teamsyncd) + * [BGP and fpmsyncd](#bgp-and-fpmsyncd) + * [JSON files](#json-files) + * [Syncd](#syncd) +* [SWSS state restore](#swss-state-restore) + * [PORT, VLAN and INTF](#port-vlan-and-intf) + * [ARP, LAG and route data in orchagent](#arp-lag-and-route-data-in-orchagent) + * [QoS, Buffer, CRM, PFC WD and ACL data in orchagent](#qos-buffer-crm-pfc-wd-and-acl-data-in-orchagent) + * [COPP, Tunnel and Mirror data in orchagent](#copp-tunnel-and-mirror-data-in-orchagent) + * [FDB and port state in orchagent](#fdb-and-port-state-in-orchagent) + * [OID for switch default objects in orchagent\.](#oid-for-switch-default-objects-in-orchagent) +* [SWSS state consistency validation](#swss-state-consistency-validation) + * [Pre\-restart state validation](#pre-restart-state-validation) + * [Post\-restore state validation](#post-restore-state-validation) +* [SWSS state sync up](#swss-state-sync-up) + * [ARP sync up](#arp-sync-up) + * [port state sync up](#port-state-sync-up) + * [FDB sync up](#fdb-sync-up) + * [LAG sync up](#lag-sync-up) + * [Route sync up](#route-sync-up) +* [SWSS docker warm restart](#swss-docker-warm-restart) + * [Separate syncd docker from swss docker service](#separate-syncd-docker-from-swss-docker-service) + * [Manual switch between warm and cold start](#manual-switch-between-warm-and-cold-start) + * [Automatic fallback to system cold restart upon warm start failure](#automatic-fallback-to-system-cold-restart-upon-warm-start-failure) +* [Open issues](#open-issues) + * [How to handle those dockers which has dependency on swss docker in systemd service?](#how-to-handle-those-dockers-which-has-dependency-on-swss-docker-in-systemd-service) + * [APPDB data restore during system warm restart](#appdb-data-restore-during-system-warm-restart) + * [Should stateDB be flushed for warm restart?](#should-statedb-be-flushed-for-warm-restart) + +# Overview +The goal of SONiC swss docker warm restart is to be able restart and upgrade swss docker software without impacting the data plane. + +The swss docker warm restart should restore all necessary control plane data and get synchronized with current network and switch states. As the single point of interface with lower layer syncd, the state processing of orchagent is the key for swss docker warm restart. + + +# Input Data for swss + +The state of swss is driven by data from multiple sources. + +![swss data source](img/swss_data_source.png) + + +## configDB +configDB and together with fixed config files port_config.ini and pg_lookup.ini form the basic data source for port, VLAN, VLAN interface, buffer, QoS, CRM, PFC watchdog and ACL. Certain data will be consumed by orchagent directly like QoS, CRM, and ACL configurations. + +## Linux Kernel +The initial port config is parsed by portsyncd and passed down to orchagent, that again will trigger SAI api call to ask ASIC SDK to create host netdev interface on Linux. Portsyncd watches the kernel RTNLGRP_LINK netlink notification about host interface creation, then signals PortInitDone to Orchagent via APPDB so the other processing in Orchagent may proceed. + +Intfsyncd listens on RTNLGRP_IPV4_IFADDR and RTNLGRP_IPV6_IFADDR netlink groups, it sets appDB INTF_TABLE with the data. + +Neighsyncd gets ARP netlink notification from Linux kernel, the data will be used by Orchagent to populate neighbor and nexthop objects. + +teamsyncd runs in teamd docker, it also listens on both RTNLGRP_LINK netlink group for LAG related notification and teamd lag member state messages. The data will be used to program APPDB lag and lagmember tables. + +## Teamd and teamsyncd +Teamd is the ultimate source of LAG related state including those kernel lag message. Teamsyncd takes instructions from both teamd and Linux kernel. + +## BGP and fpmsyncd + +All the APPDB ROUTE_TABLE data is from BGP docker. fpmsyncd which programs APPDB ROUTE_TABLE directly runs in BGP docker. + +## JSON files +For copp, tunnel and mirror related configurations, they are loaded from json files to corresponding APPDB tables. + +## Syncd +FDB and Port state notifications come from ASIC, syncd relays the data to orchagent. +Orchagent also gets info for the objects created by ASIC by default, ex. the port list, hw lanes and queues. + +# SWSS state restore +During swss warm restart, the state of swss should be restored. It is assumed that all data in APPDB has either been restored or been kept intact. + +## PORT, VLAN and INTF +Portsyncd has some internal state processing as to PortConfigDone and PortInitDone. We don't restore port data from APPDB directly, but rely on port config and linux RTM_GETLINK dump to populate port data again. + +VLAN also needs to set up Linux host environment for vlan aware bridge. But for swss docker restart, the previous bridge setup is kept by Linux, vlanmgrd should avoid deleting and creating the bridge again. + +For APP INTF, netlink.dumpRequest(RTM_GETADDR) in intfsyncd will populate interface IP from Linux. + +## ARP, LAG and route data in orchagent +These data will be restored from APPDB directly by orchagent. + +## QoS, Buffer, CRM, PFC WD and ACL data in orchagent +Orchagent fetch the existing data from configDB at startup. + +## COPP, Tunnel and Mirror data in orchagent +These configuration will be loaded to APPDB from JSON files then received by orchagent at startup. + +## FDB and port state in orchagent +Both the FDB and port state data is restored from APPDB by orchagent. + +## OID for switch default objects in orchagent. +Orchagent relies on SAI get api to fetch the OID data from syncd for switch default objects. + +# SWSS state consistency validation +After swss state restore, the state of each swss processes especially orchagent should be consistent with the state before restart. +For now, it is assumed that no configDB change during the whole warm restart window. Then the state of orchagent is mainly driven by APPDB data changes. Following basic pre-restart and post-restore validation could be applied. + +## Pre-restart state validation +A "restart prepare" request is sent to orchagent, if there no pending data in SyncMap (m_toSync) of all application consumers in orchagent, OrchDaemon will set a flag to stop processing any further APPDB data change and return success for the "restart prepare" +request. Otherwise failure should be returned for the request to indicate that there is un-fullfilled dependency in orchagent which is not ready to do warm restart. + +The existing ProducerStateTable/ConsumerStateTable implementation should be updated so that only consumer side modify the actual table. + +## Post-restore state validation +After swss state restore, same as that in pre-restart phase, no pending data in SyncMap (m_toSync) of all application consumers should exist. This should be done before swss state sync up. + + *More exhaustive validation beyond this is to be designed and implemented.* + +# SWSS state sync up +During the restart window, dynamic data like ARP, port state, FDB, LAG and route may be changed. Orchagent needs to sync up with the latest network state. + +It is assumed that no configDB data change happened. + +## ARP sync up +At startup of neighsyncd, netlink.dumpRequest(RTM_GETNEIGH) has been called. But there may be stale entries in APPDB NEIGH_TABLE and they should be removed. + +For the warm restart of whole system, the Linux arp data is gone too. All the neighbors should be pinged again to get their mac info. + +## port state sync up +Physical state of port may change during swss restart, orchagent should fetch the latest port state or syncd should queue all the unproccessed port state notification for orchagent to pick up. + +## FDB sync up +FDB entries may change during swss restart, orchagent should fetch the latest FDB entries and remove stale ones, or syncd should queue all the unproccessed FDB notification for orchagent to pick up. + +## LAG sync up +Upon swss restart, teamsyncd should have queued all the LAG changes in APPDB, orchagent may process those changes after restart. In case teamd docker restarted too, it is responsible for getting LAG_TABLE and LAG_MEMBER_TABLE up to date in APPDB. + +## Route sync up +Upon swss restart, fpmsyncd should have queued all the route changes in APPDB, orchagent may process those changes after restart. In case bgp docker restarted too, it is responsible for getting ROUTE_TABLE up to date in APPDB, which includes removing stale route entries and add new route entries. + +# SWSS docker warm restart +## Separate syncd docker from swss docker service +Syncd docker restart should not be triggered by swss docker restart. A systemd syncd service should be created independent of swss service. + +## Manual switch between warm and cold start +Upon detecting the existence of certain folder, ex. /etc/sonic/warm_restart, swss docker will assume warm restart has been requested. Otherwise swss will flush db and perform cold start. +A CLI and function in configDB enforcer should be implemented for enabling/disabling warm restart. + +## Automatic fallback to system cold restart upon warm start failure +Mechanism should be available to automatically fall back to system cold restart upon failure of warm start. + +# Open issues + +## How to handle those dockers which has dependency on swss docker in systemd service? +dhcp_relay, snmp and radv dockers will be restart too because of their systemd dependency on swss docker when swss docker is restarted with systemctl. + +## APPDB data restore during system warm restart +How to restore all APPDB data after system restart, is there tools available for dump and restore whole redis DB? + +## Should stateDB be flushed for warm restart? +For swss docker warm restart, keep stateDB. For system restart, since Linux host environment should be restored in this case, the stateDB needs to be flushed and populated again with Ethernet, LAG and VLAN data. + + +