Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[swss/syncd] race between orchagent removes RIF rate counters from DB and lua script fetching them #11621

Closed
stepanblyschak opened this issue Aug 4, 2022 · 6 comments
Assignees
Labels
Issue for 202205 Triaged this issue has been triaged

Comments

@stepanblyschak
Copy link
Collaborator

Description

Errors in the log comming from RIF rate lua script due to orchagent removes RIF rates from COUNTERS DB.

Steps to reproduce the issue:

  1. Create RIF in SONiC, wait till RIF rates are populated in COUNTERS DB
  2. Remove RIF
  3. Repeat until you see:
Aug  4 02:04:51.724424 r-ocelot-02 NOTICE swss#orchagent: :- cleanUpRifFromCounterDb: CleanUp interface PortChannel33 oid oid:0x6000000000823 from counter db
Aug  4 02:04:51.724465 r-ocelot-02 ERR syncd#SDK: :- guard: RedisReply catches system_error: command: *86#015#012$7#015#012EVALSHA#015#012$40#015#0125f7e8b14a9d450f29760700b318672020ca52eab#015#012$2#015#01279#015#012$19#015#012oid:0x6000000000800#015#012$19#015#012oid:0x6000000000801#015#012$19#015#012oid:0x6000000000802#015#012$19#015#012oid:0x6000000000803#015#012$19#015#012oid:0x6000000000804#015#012$19#015#012oid:0x6000000000805#015#012$19#015#012oid:0x6000000000806#015#012$19#015#012oid:0x6000000000807#015#012$19#015#012oid:0x6000000000808#015#012$19#015#012oid:0x6000000000809#015#012$19#015#012oid:0x600000000080b#015#012$19#015#012oid:0x6000000000823#015#012$19#015#012oid:0x6000000000824#015#012$19#015#012oid:0x6000000000825#015#012$19#015#012oid:0x6000000000826#015#012$19#015#012oid:0x6000000000827#015#012$19#015#012oid:0x6000000000828#015#012$19#015#012oid:0x6000000000829#015#012$19#015#012oid:0x600000000082b#015#012$19#015#012oid:0x600000000082c#015#012$19#015#012oid:0x600000000082d#015#012$19#015#012oid:0x600000000082e#015#012$19#015#012oid:0x600000000082f#015#012$19#015#012oid:0x6000000000830#015#012$19#015#012oid:0x6000000000831#015#012$19#015#012oid:0x6000000000832#015#012$19#015#012oid:0x6000000000833#015#012$19#015#012oid:0x6000000000834#015#012$19#015#012oid:0x6000000000836#015#012$19#015#012oid:0x6000000000837#015#012$19#015#012oid:0x6000000000838#015#012$19#015#012oid:0x6000000000839#015#012$19#015#012oid:0x600000000083a#015#012$19#015#012oid:0x600000000083b#015#012$19#015#012oid:0x600000000083c#015#012$19#015#012oid:0x600000000083d#015#012$19#015#012oid:0x600000000083e#015#012$19#015#012oid:0x600000000083f#015#012$19#015#012oid:0x6000000000841#015#012$19#015#012oid:0x6000000000842#015#012$19#015#012oid:0x6000000000843#015#012$19#015#012oid:0x6000000000844#015#012$19#015#012oid:0x6000000000845#015#012$19#015#012oid:0x6000000000846#015#012$19#015#012oid:0x6000000000847#015#012$19#015#012oid:0x6000000000848#015#012$19#015#012oid:0x6000000000849#015#012$19#015#012oid:0x600000000084a#015#012$19#015#012oid:0x600000000084c#015#012$19#015#012oid:0x600000000084d#015#012$19#015#012oid:0x600000000084e#015#012$19#015#012oid:0x600000000084f#015#012$19#015#012oid:0x6000000000850#015#012$19#015#012oid:0x6000000000851#015#012$19#015#012oid:0x6000000000852#015#012$19#015#012oid:0x6000000000853#015#012$19#015#012oid:0x6000000000854#015#012$19#015#012oid:0x6000000000855#015#012$19#015#012oid:0x6000000000857#015#012$19#015#012oid:0x6000000000858#015#012$19#015#012oid:0x6000000000859#015#012$19#015#012oid:0x600000000085a#015#012$19#015#012oid:0x600000000085b#015#012$19#015#012oid:0x600000000085c#015#012$19#015#012oid:0x600000000085d#015#012$19#015#012oid:0x600000000085e#015#012$19#015#012oid:0x600000000085f#015#012$19#015#012oid:0x6000000000860#015#012$19#015#012oid:0x6000000000862#015#012$19#015#012oid:0x6000000000863#015#012$19#015#012oid:0x6000000000864#015#012$19#015#012oid:0x6000000000865#015#012$19#015#012oid:0x6000000000866#015#012$19#015#012oid:0x6000000000867#015#012$19#015#012oid:0x6000000000868#015#012$19#015#012oid:0x6000000000869#015#012$19#015#012oid:0x600000000086a#015#012$19#015#012oid:0x600000000086b#015#012$19#015#012oid:0x600000000086c#015#012$1#015#0122#015#012$8#015#012COUNTERS#015#012$4#015#0121000#015#012$2#015#012''#015#012, reason: ERR Error running script (call to f_5f7e8b14a9d450f29760700b318672020ca52eab): @user_script:48: user_script:48: attempt to perform arithmetic on local 'in_octets' (a boolean value): Input/output error
Aug  4 02:04:51.724465 r-ocelot-02 ERR syncd#SDK: :- runRedisScript: Caught exception while running Redis lua script: RedisReply catches system_error: command: *86#015#012$7#015#012EVALSHA#015#012$40#015#0125f7e8b14a9d450f29760700b318672020ca52eab#015#012$2#015#01279#015#012$19#015#012oid:0x6000000000800#015#012$19#015#012oid:0x6000000000801#015#012$19#015#012oid:0x6000000000802#015#012$19#015#012oid:0x6000000000803#015#012$19#015#012oid:0x6000000000804#015#012$19#015#012oid:0x6000000000805#015#012$19#015#012oid:0x6000000000806#015#012$19#015#012oid:0x6000000000807#015#012$19#015#012oid:0x6000000000808#015#012$19#015#012oid:0x6000000000809#015#012$19#015#012oid:0x600000000080b#015#012$19#015#012oid:0x6000000000823#015#012$19#015#012oid:0x6000000000824#015#012$19#015#012oid:0x6000000000825#015#012$19#015#012oid:0x6000000000826#015#012$19#015#012oid:0x6000000000827#015#012$19#015#012oid:0x6000000000828#015#012$19#015#012oid:0x6000000000829#015#012$19#015#012oid:0x600000000082b#015#012$19#015#012oid:0x600000000082c#015#012$19#015#012oid:0x600000000082d#015#012$19#015#012oid:0x600000000082e#015#012$19#015#012oid:0x600000000082f#015#012$19#015#012oid:0x6000000000830#015#012$19#015#012oid:0x6000000000831#015#012$19#015#012oid:0x6000000000832#015#012$19#015#012oid:0x6000000000833#015#012$19#015#012oid:0x6000000000834#015#012$19#015#012oid:0x6000000000836#015#012$19#015#012oid:0x6000000000837#015#012$19#015#012oid:0x6000000000838#015#012$19#015#012oid:0x6000000000839#015#012$19#015#012oid:0x600000000083a#015#012$19#015#012oid:0x600000000083b#015#012$19#015#012oid:0x600000000083c#015#012$19#015#012oid:0x600000000083d#015#012$19#015#012oid:0x600000000083e#015#012$19#015#012oid:0x600000000083f#015#012$19#015#012oid:0x6000000000841#015#012$19#015#012oid:0x6000000000842#015#012$19#015#012oid:0x6000000000843#015#012$19#015#012oid:0x6000000000844#015#012$19#015#012oid:0x6000000000845#015#012$19#015#012oid:0x6000000000846#015#012$19#015#012oid:0x6000000000847#015#012$19#015#012oid:0x6000000000848#015#012$19#015#012oid:0x6000000000849#015#012$19#015#012oid:0x600000000084a#015#012$19#015#012oid:0x600000000084c#015#012$19#015#012oid:0x600000000084d#015#012$19#015#012oid:0x600000000084e#015#012$19#015#012oid:0x600000000084f#015#012$19#015#012oid:0x6000000000850#015#012$19#015#012oid:0x6000000000851#015#012$19#015#012oid:0x6000000000852#015#012$19#015#012oid:0x6000000000853#015#012$19#015#012oid:0x6000000000854#015#012$19#015#012oid:0x6000000000855#015#012$19#015#012oid:0x6000000000857#015#012$19#015#012oid:0x6000000000858#015#012$19#015#012oid:0x6000000000859#015#012$19#015#012oid:0x600000000085a#015#012$19#015#012oid:0x600000000085b#015#012$19#015#012oid:0x600000000085c#015#012$19#015#012oid:0x600000000085d#015#012$19#015#012oid:0x600000000085e#015#012$19#015#012oid:0x600000000085f#015#012$19#015#012oid:0x6000000000860#015#012$19#015#012oid:0x6000000000862#015#012$19#015#012oid:0x6000000000863#015#012$19#015#012oid:0x6000000000864#015#012$19#015#012oid:0x6000000000865#015#012$19#015#012oid:0x6000000000866#015#012$19#015#012oid:0x6000000000867#015#012$19#015#012oid:0x6000000000868#015#012$19#015#012oid:0x6000000000869#015#012$19#015#012oid:0x600000000086a#015#012$19#015#012oid:0x600000000086b#015#012$19#015#012oid:0x600000000086c#015#012$1#015#0122#015#012$8#015#012COUNTERS#015#012$4#015#0121000#015#012$2#015#012''#015#012, reason: ERR Error running script (call to f_5f7e8b14a9d450f29760700b318672020ca52eab): @user_script:48: user_script:48: attempt to perform arithmetic on local 'in_octets' (a boolean value): Input/output error: Input/output error

Describe the results you received:

Errors in the logs:

Aug  4 02:04:51.724424 r-ocelot-02 NOTICE swss#orchagent: :- cleanUpRifFromCounterDb: CleanUp interface PortChannel33 oid oid:0x6000000000823 from counter db
Aug  4 02:04:51.724465 r-ocelot-02 ERR syncd#SDK: :- guard: RedisReply catches system_error: command: *86#015#012$7#015#012EVALSHA#015#012$40#015#0125f7e8b14a9d450f29760700b318672020ca52eab#015#012$2#015#01279#015#012$19#015#012oid:0x6000000000800#015#012$19#015#012oid:0x6000000000801#015#012$19#015#012oid:0x6000000000802#015#012$19#015#012oid:0x6000000000803#015#012$19#015#012oid:0x6000000000804#015#012$19#015#012oid:0x6000000000805#015#012$19#015#012oid:0x6000000000806#015#012$19#015#012oid:0x6000000000807#015#012$19#015#012oid:0x6000000000808#015#012$19#015#012oid:0x6000000000809#015#012$19#015#012oid:0x600000000080b#015#012$19#015#012oid:0x6000000000823#015#012$19#015#012oid:0x6000000000824#015#012$19#015#012oid:0x6000000000825#015#012$19#015#012oid:0x6000000000826#015#012$19#015#012oid:0x6000000000827#015#012$19#015#012oid:0x6000000000828#015#012$19#015#012oid:0x6000000000829#015#012$19#015#012oid:0x600000000082b#015#012$19#015#012oid:0x600000000082c#015#012$19#015#012oid:0x600000000082d#015#012$19#015#012oid:0x600000000082e#015#012$19#015#012oid:0x600000000082f#015#012$19#015#012oid:0x6000000000830#015#012$19#015#012oid:0x6000000000831#015#012$19#015#012oid:0x6000000000832#015#012$19#015#012oid:0x6000000000833#015#012$19#015#012oid:0x6000000000834#015#012$19#015#012oid:0x6000000000836#015#012$19#015#012oid:0x6000000000837#015#012$19#015#012oid:0x6000000000838#015#012$19#015#012oid:0x6000000000839#015#012$19#015#012oid:0x600000000083a#015#012$19#015#012oid:0x600000000083b#015#012$19#015#012oid:0x600000000083c#015#012$19#015#012oid:0x600000000083d#015#012$19#015#012oid:0x600000000083e#015#012$19#015#012oid:0x600000000083f#015#012$19#015#012oid:0x6000000000841#015#012$19#015#012oid:0x6000000000842#015#012$19#015#012oid:0x6000000000843#015#012$19#015#012oid:0x6000000000844#015#012$19#015#012oid:0x6000000000845#015#012$19#015#012oid:0x6000000000846#015#012$19#015#012oid:0x6000000000847#015#012$19#015#012oid:0x6000000000848#015#012$19#015#012oid:0x6000000000849#015#012$19#015#012oid:0x600000000084a#015#012$19#015#012oid:0x600000000084c#015#012$19#015#012oid:0x600000000084d#015#012$19#015#012oid:0x600000000084e#015#012$19#015#012oid:0x600000000084f#015#012$19#015#012oid:0x6000000000850#015#012$19#015#012oid:0x6000000000851#015#012$19#015#012oid:0x6000000000852#015#012$19#015#012oid:0x6000000000853#015#012$19#015#012oid:0x6000000000854#015#012$19#015#012oid:0x6000000000855#015#012$19#015#012oid:0x6000000000857#015#012$19#015#012oid:0x6000000000858#015#012$19#015#012oid:0x6000000000859#015#012$19#015#012oid:0x600000000085a#015#012$19#015#012oid:0x600000000085b#015#012$19#015#012oid:0x600000000085c#015#012$19#015#012oid:0x600000000085d#015#012$19#015#012oid:0x600000000085e#015#012$19#015#012oid:0x600000000085f#015#012$19#015#012oid:0x6000000000860#015#012$19#015#012oid:0x6000000000862#015#012$19#015#012oid:0x6000000000863#015#012$19#015#012oid:0x6000000000864#015#012$19#015#012oid:0x6000000000865#015#012$19#015#012oid:0x6000000000866#015#012$19#015#012oid:0x6000000000867#015#012$19#015#012oid:0x6000000000868#015#012$19#015#012oid:0x6000000000869#015#012$19#015#012oid:0x600000000086a#015#012$19#015#012oid:0x600000000086b#015#012$19#015#012oid:0x600000000086c#015#012$1#015#0122#015#012$8#015#012COUNTERS#015#012$4#015#0121000#015#012$2#015#012''#015#012, reason: ERR Error running script (call to f_5f7e8b14a9d450f29760700b318672020ca52eab): @user_script:48: user_script:48: attempt to perform arithmetic on local 'in_octets' (a boolean value): Input/output error
Aug  4 02:04:51.724465 r-ocelot-02 ERR syncd#SDK: :- runRedisScript: Caught exception while running Redis lua script: RedisReply catches system_error: command: *86#015#012$7#015#012EVALSHA#015#012$40#015#0125f7e8b14a9d450f29760700b318672020ca52eab#015#012$2#015#01279#015#012$19#015#012oid:0x6000000000800#015#012$19#015#012oid:0x6000000000801#015#012$19#015#012oid:0x6000000000802#015#012$19#015#012oid:0x6000000000803#015#012$19#015#012oid:0x6000000000804#015#012$19#015#012oid:0x6000000000805#015#012$19#015#012oid:0x6000000000806#015#012$19#015#012oid:0x6000000000807#015#012$19#015#012oid:0x6000000000808#015#012$19#015#012oid:0x6000000000809#015#012$19#015#012oid:0x600000000080b#015#012$19#015#012oid:0x6000000000823#015#012$19#015#012oid:0x6000000000824#015#012$19#015#012oid:0x6000000000825#015#012$19#015#012oid:0x6000000000826#015#012$19#015#012oid:0x6000000000827#015#012$19#015#012oid:0x6000000000828#015#012$19#015#012oid:0x6000000000829#015#012$19#015#012oid:0x600000000082b#015#012$19#015#012oid:0x600000000082c#015#012$19#015#012oid:0x600000000082d#015#012$19#015#012oid:0x600000000082e#015#012$19#015#012oid:0x600000000082f#015#012$19#015#012oid:0x6000000000830#015#012$19#015#012oid:0x6000000000831#015#012$19#015#012oid:0x6000000000832#015#012$19#015#012oid:0x6000000000833#015#012$19#015#012oid:0x6000000000834#015#012$19#015#012oid:0x6000000000836#015#012$19#015#012oid:0x6000000000837#015#012$19#015#012oid:0x6000000000838#015#012$19#015#012oid:0x6000000000839#015#012$19#015#012oid:0x600000000083a#015#012$19#015#012oid:0x600000000083b#015#012$19#015#012oid:0x600000000083c#015#012$19#015#012oid:0x600000000083d#015#012$19#015#012oid:0x600000000083e#015#012$19#015#012oid:0x600000000083f#015#012$19#015#012oid:0x6000000000841#015#012$19#015#012oid:0x6000000000842#015#012$19#015#012oid:0x6000000000843#015#012$19#015#012oid:0x6000000000844#015#012$19#015#012oid:0x6000000000845#015#012$19#015#012oid:0x6000000000846#015#012$19#015#012oid:0x6000000000847#015#012$19#015#012oid:0x6000000000848#015#012$19#015#012oid:0x6000000000849#015#012$19#015#012oid:0x600000000084a#015#012$19#015#012oid:0x600000000084c#015#012$19#015#012oid:0x600000000084d#015#012$19#015#012oid:0x600000000084e#015#012$19#015#012oid:0x600000000084f#015#012$19#015#012oid:0x6000000000850#015#012$19#015#012oid:0x6000000000851#015#012$19#015#012oid:0x6000000000852#015#012$19#015#012oid:0x6000000000853#015#012$19#015#012oid:0x6000000000854#015#012$19#015#012oid:0x6000000000855#015#012$19#015#012oid:0x6000000000857#015#012$19#015#012oid:0x6000000000858#015#012$19#015#012oid:0x6000000000859#015#012$19#015#012oid:0x600000000085a#015#012$19#015#012oid:0x600000000085b#015#012$19#015#012oid:0x600000000085c#015#012$19#015#012oid:0x600000000085d#015#012$19#015#012oid:0x600000000085e#015#012$19#015#012oid:0x600000000085f#015#012$19#015#012oid:0x6000000000860#015#012$19#015#012oid:0x6000000000862#015#012$19#015#012oid:0x6000000000863#015#012$19#015#012oid:0x6000000000864#015#012$19#015#012oid:0x6000000000865#015#012$19#015#012oid:0x6000000000866#015#012$19#015#012oid:0x6000000000867#015#012$19#015#012oid:0x6000000000868#015#012$19#015#012oid:0x6000000000869#015#012$19#015#012oid:0x600000000086a#015#012$19#015#012oid:0x600000000086b#015#012$19#015#012oid:0x600000000086c#015#012$1#015#0122#015#012$8#015#012COUNTERS#015#012$4#015#0121000#015#012$2#015#012''#015#012, reason: ERR Error running script (call to f_5f7e8b14a9d450f29760700b318672020ca52eab): @user_script:48: user_script:48: attempt to perform arithmetic on local 'in_octets' (a boolean value): Input/output error: Input/output error

Describe the results you expected:

No errors

Output of show version:

SONiC Software Version: SONiC.202205.20-b1456ee1c_Internal
Distribution: Debian 11.4
Kernel: 5.10.0-12-2-amd64
Build commit: b1456ee1c
Build date: Mon Aug  1 12:23:14 UTC 2022
Built by: sw-r2d2-bot@r-build-sonic-ci03-241

Platform: x86_64-mlnx_msn4410-r0
HwSKU: ACS-MSN4410
ASIC: mellanox
ASIC Count: 1
Serial Number: MT2039X06760
Model Number: MSN4410-WS2FO
Hardware Revision: A1
Uptime: 12:44:43 up  3:15,  2 users,  load average: 0.38, 0.92, 1.06
Date: Thu 04 Aug 2022 12:44:43

Docker images:
REPOSITORY                                         TAG                            IMAGE ID       SIZE
docker-platform-monitor                            202205.20-b1456ee1c_Internal   b92a53896431   993MB
docker-platform-monitor                            latest                         b92a53896431   993MB
docker-syncd-mlnx                                  202205.20-b1456ee1c_Internal   2ade340a46e5   990MB
docker-syncd-mlnx                                  latest                         2ade340a46e5   990MB
docker-orchagent                                   202205.20-b1456ee1c_Internal   761281d6826f   475MB
docker-orchagent                                   latest                         761281d6826f   475MB
docker-macsec                                      latest                         b8a3913eb251   458MB
docker-dhcp-relay                                  latest                         38c5beaa89c7   450MB
docker-sonic-telemetry                             202205.20-b1456ee1c_Internal   fc6f91872da8   520MB
docker-sonic-telemetry                             latest                         fc6f91872da8   520MB
docker-database                                    202205.20-b1456ee1c_Internal   640475ea9c81   440MB
docker-database                                    latest                         640475ea9c81   440MB
docker-router-advertiser                           202205.20-b1456ee1c_Internal   563f997a1fd8   440MB
docker-router-advertiser                           latest                         563f997a1fd8   440MB
docker-mux                                         202205.20-b1456ee1c_Internal   e2993f50f7d5   489MB
docker-mux                                         latest                         e2993f50f7d5   489MB
docker-fpm-frr                                     202205.20-b1456ee1c_Internal   922438d17944   454MB
docker-fpm-frr                                     latest                         922438d17944   454MB
docker-nat                                         202205.20-b1456ee1c_Internal   c38b7dad2664   428MB
docker-nat                                         latest                         c38b7dad2664   428MB
docker-sflow                                       202205.20-b1456ee1c_Internal   925816435673   426MB
docker-sflow                                       latest                         925816435673   426MB
docker-teamd                                       202205.20-b1456ee1c_Internal   aeee8e768530   425MB
docker-teamd                                       latest                         aeee8e768530   425MB
docker-snmp                                        202205.20-b1456ee1c_Internal   5f9f61d8d698   453MB
docker-snmp                                        latest                         5f9f61d8d698   453MB
docker-lldp                                        202205.20-b1456ee1c_Internal   5a56fc2cb5de   450MB
docker-lldp                                        latest                         5a56fc2cb5de   450MB
docker-sonic-mgmt-framework                        202205.20-b1456ee1c_Internal   f1d1439365c9   554MB
docker-sonic-mgmt-framework                        latest                         f1d1439365c9   554MB
urm.nvidia.com/sw-nbu-sws-sonic-docker/sonic-wjh   1.3.0-202205-internal-11       215140d2cb02   494MB

Output of show techsupport:

[syslog.182.gz](https:/sonic-net/sonic-buildimage/files/9260600/syslog.182.gz)

Additional information you deem important (e.g. issue happens only occasionally):

@stepanblyschak
Copy link
Collaborator Author

@sumanbrcm Could you please handle this issue?

PR: sonic-net/sonic-swss#2199

@sumanbrcm
Copy link

sumanbrcm commented Aug 16, 2022

@stepanblyschak Sure, I will check and work on the fix accordingly . As per the previous discussion in PR (sonic-net/sonic-swss#2199) , the proposal was to move cleanup code to syncd . As per your suggestion >>the cleanup of COUNTERS RIF tables should be done in syncd while cleanup of the mapping should still be done in orchagent <<
Once we have converged on the use case will fix on that line.
Specific to the issue which you reported the likely reason is fetching in_octets has failed and but we are still using in_octets in calculation in below line :
local rx_bps_new = (in_octets - in_octets_last) / delta * 1000
This is because table (counters_table_name) might not be existing for the rif.
Similar issue is taken care in port_rates.lua script in the following way :
if not in_ucast_pkts or not in_non_ucast_pkts or not out_ucast_pkts or
not out_non_ucast_pkts or not in_octets or not out_octets then
logit("Not found some counters on " .. port)
return
end

We can fix this fix in similar lines along with moving the cleanup code to syncd.

@yxieca
Copy link
Contributor

yxieca commented Aug 17, 2022

@sumanbrcm is actively working on this issue.

@yxieca yxieca added the Triaged this issue has been triaged label Aug 17, 2022
@liat-grozovik
Copy link
Collaborator

@sumanbrcm could you please provide ETA when the fix can be in 202205?

@sumanbrcm
Copy link

@liat-grozovik ETA for this fix is 10th Oct.

liat-grozovik pushed a commit to sonic-net/sonic-swss that referenced this issue Nov 8, 2022
The cleanup code for stale rif counters are now moved to syncd . Earlier as part of fix for issue #2193 the cleanup for stale rif counters was added.
But it could create a race condition between orchagent removes RIF rate counters from DB and lua script fetching them.
So as a fix all such cleanup has been moved to syncd.

- What I did
Fix for sonic-net/sonic-buildimage#11621

As a past fix which aimed at removing stale rif counters (#2199) , there is a chance of race condition and it leads to lua script reporting error.
To handle this , the rif counters cleanup code(handled in cleanUpRifFromCounterDb) is now called from syncd ( removeCounter ) to avoid such race condition.

- Why I did it

The operations in Orchagent and syncd is not synchronous, so while Orchagent deletes the rif counters from Counters Db, the syncd could still access it. In race conditions the lua script trying to fetch rif counters will have errors syslog for such access as it was already deleted by orchagent. The cleanup code is removed from orchagent is added in syncd - it will make sure no such race condition would get hit.

- How I verified it

Followed the steps in (sonic-net/sonic-buildimage#11621) :

Create RIF in SONiC, wait till RIF rates are populated in COUNTERS DB
Remove RIF
Repeat the steps multiple times and check if any error syslog is seen (No error syslog is seen)
Also checked cleanup for rif counters.

After RIF creation derived info of oid for RIF from "COUNTERS_RIF_NAME_MAP"
127) "Vlan100"
128) "oid:0x6000000000aa5"

Checked all the tabled in COUNTER_DB which has same OID in keys
127.0.0.1:6379[2]> keys 6000000000aa5
1) "RATES:oid:0x6000000000aa5:RIF"
2) "COUNTERS:oid:0x6000000000aa5"
3) "RATES:oid:0x6000000000aa5"
127.0.0.1:6379[2]>

Deleted the RIF by removing the ip on the intf.

Checked COUNTER_DB again with same OID if there are stale entries or not. No stale entries exist now.
127.0.0.1:6379[2]> keys 6000000000aa5
(empty array)
127.0.0.1:6379[2]>

Signed-off-by: Suman Kumar <[email protected]>
yxieca pushed a commit to sonic-net/sonic-swss that referenced this issue Nov 10, 2022
The cleanup code for stale rif counters are now moved to syncd . Earlier as part of fix for issue #2193 the cleanup for stale rif counters was added.
But it could create a race condition between orchagent removes RIF rate counters from DB and lua script fetching them.
So as a fix all such cleanup has been moved to syncd.

- What I did
Fix for sonic-net/sonic-buildimage#11621

As a past fix which aimed at removing stale rif counters (#2199) , there is a chance of race condition and it leads to lua script reporting error.
To handle this , the rif counters cleanup code(handled in cleanUpRifFromCounterDb) is now called from syncd ( removeCounter ) to avoid such race condition.

- Why I did it

The operations in Orchagent and syncd is not synchronous, so while Orchagent deletes the rif counters from Counters Db, the syncd could still access it. In race conditions the lua script trying to fetch rif counters will have errors syslog for such access as it was already deleted by orchagent. The cleanup code is removed from orchagent is added in syncd - it will make sure no such race condition would get hit.

- How I verified it

Followed the steps in (sonic-net/sonic-buildimage#11621) :

Create RIF in SONiC, wait till RIF rates are populated in COUNTERS DB
Remove RIF
Repeat the steps multiple times and check if any error syslog is seen (No error syslog is seen)
Also checked cleanup for rif counters.

After RIF creation derived info of oid for RIF from "COUNTERS_RIF_NAME_MAP"
127) "Vlan100"
128) "oid:0x6000000000aa5"

Checked all the tabled in COUNTER_DB which has same OID in keys
127.0.0.1:6379[2]> keys 6000000000aa5
1) "RATES:oid:0x6000000000aa5:RIF"
2) "COUNTERS:oid:0x6000000000aa5"
3) "RATES:oid:0x6000000000aa5"
127.0.0.1:6379[2]>

Deleted the RIF by removing the ip on the intf.

Checked COUNTER_DB again with same OID if there are stale entries or not. No stale entries exist now.
127.0.0.1:6379[2]> keys 6000000000aa5
(empty array)
127.0.0.1:6379[2]>

Signed-off-by: Suman Kumar <[email protected]>
@adyeung
Copy link
Collaborator

adyeung commented Nov 11, 2022

Fix merged

@adyeung adyeung closed this as completed Nov 11, 2022
yxieca pushed a commit to sonic-net/sonic-sairedis that referenced this issue Feb 9, 2023
Changes for 202205 branch

Fixing issue #11621
All the details of the fix in master branch is in the PR : -
sairedis: Fixing race condition for rif counters #1136
swss: Fixing race condition for rif counters sonic-swss#2488
This change is already merged to master branch , but sonic-sairedis change (sairedis: Fixing race condition for rif counters #1136) could not be cherry picked to 202205 branch as the base file FlexCounter.cpp differs in both the branches. Also, the API removeDataFromCountersDB is not available in 202205 branch for the cleanup. Hence the current fix takes care of cleaning up the stale rif counters issue through newly added API cleanUpRifFromCounterDb .
Unit Tests:-
The steps followed are same as in sonic-net/sonic-buildimage#11621

Create RIF in SONiC, wait till RIF rates are populated in COUNTERS DB
Remove RIF
Repeat the steps multiple times and check if any error syslog is seen (No error syslog is seen)
Also checked cleanup for rif counters.
After RIF creation derived info of oid for RIF from "COUNTERS_RIF_NAME_MAP"
127) "Vlan100"
128) "oid:0x6000000000aa5"

Checked all the tabled in COUNTER_DB which has same OID in keys
127.0.0.1:6379[2]> keys 6000000000aa5

"RATES:oid:0x6000000000aa5:RIF"
"COUNTERS:oid:0x6000000000aa5"
"RATES:oid:0x6000000000aa5"
127.0.0.1:6379[2]>
Deleted the RIF by removing the ip on the intf.

Checked COUNTER_DB again with same OID if there are stale entries or not. No stale entries exist now.
127.0.0.1:6379[2]> keys 6000000000aa5
(empty array)
127.0.0.1:6379[2]>
Signed-off-by: Suman Kumar [email protected]

Signed-off-by: Suman Kumar <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Issue for 202205 Triaged this issue has been triaged
Projects
None yet
Development

No branches or pull requests

6 participants