Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SHM Acks degrade over time #1194

Closed
FlorianReimold opened this issue Sep 6, 2023 · 3 comments · Fixed by #1207
Closed

SHM Acks degrade over time #1194

FlorianReimold opened this issue Sep 6, 2023 · 3 comments · Fixed by #1207
Assignees
Labels
bug Something isn't working

Comments

@FlorianReimold
Copy link
Member

FlorianReimold commented Sep 6, 2023

Problem Description

eCAL seems to not handle SHM Acknowledgements properly. Even in a system that has enough resources, everything seems to degrade over time and the publisher will run into SHM ack timouts.

  • A first analysis shows, that if the Subscriber is not able to get memfile access in 5ms (hardcoded!) It will omit sending the shm ack and let the publisher wait for the timeout.

    if(m_memfile.GetReadAccess(5))

  • Also, the list of events hold by the publisher is never cleaned up, so when subscribers go away, the publisher will still try to receive their events.

    • Hypothesis: The following TODO has something to do with that:
      bool CDataWriterSHM::RemLocConnection(const std::string& process_id_)
      {
      if (!m_created) return false;
      bool ret_state(true);
      for (auto& memory_file : m_memory_file_vec)
      {
      // This is not working correctly under POSIX for memory files that are read and written within the same process.
      //
      // The functions 'CSyncMemoryFile::Disconnect' and 'CDataWriterSHM::RemLocConnection' are now called
      // by the new Subscriber Unregistration event logic and were never called in any previous eCAL version.
      //
      // TODO: Fix this in 'CSyncMemoryFile::Disconnect' to handle event resources properly.

How to reproduce

  1. Set Ack timeout to 500 in ecal.ini
  2. Compile the ecal-perftool
  3. Start 1 publisher
    ecal_perftool pub topic 25 1024000
    
  4. Start 4 susbscribers
    ecal_perftool sub topic 10
    
  5. Start 5 subscribers that slow down the system a little bit, but not too much
    ecal_perftool sub topic 100
    

The publishing freq should be 10 Hz (due to the 100ms sub, that slows everything down), but after some time it will go down to 2Hz, which is the ack timeout:
image

How did you get eCAL?

Ubuntu PPA (apt-get)

Environment

eCAL 5.11.5
Ubuntu 20.04

eCAL System Information

------------------------- SYSTEM ---------------------------------
Version                  : v5.11.5 (2023-07-24 12:19:00 +0200)
Platform                 : linux

------------------------- CONFIGURATION --------------------------
Default INI              : /etc/ecal/ecal.ini

------------------------- NETWORK --------------------------------
Host name                : florian-ubuntu20
Network mode             : local
Network ttl              : 2
Network sndbuf           : 5 MByte
Network rcvbuf           : 5 MByte
Multicast group          : 239.0.0.1
Multicast mask           : 0.0.0.15
Multicast ports          : 14000 - 14010
Multicast join all IFs   : off
Bandwidth limit (udp)    : not limited

------------------------- TIME -----------------------------------
Synchronization realtime : "ecaltime-localtime"
Synchronization replay   :
State                    :  synchronized
Master / Slave           :  Master
Status (Code)            : "everything is fine." (0)

------------------------- PUBLISHER LAYER DEFAULTS ---------------
Layer Mode INPROC        : off
Layer Mode SHM           : auto
Layer Mode TCP           : off
Layer Mode UDP MC        : auto

------------------------- SUBSCRIPTION LAYER DEFAULTS ------------
Layer Mode INPROC        : on
Layer Mode SHM           : on
Layer Mode TCP           : on
Layer Mode UDP MC        : on
Npcap UDP Reciever       : off
@FlorianReimold FlorianReimold added the bug Something isn't working label Sep 6, 2023
@FlorianReimold
Copy link
Member Author

Current (workaround?) branch is here:
https:/eclipse-ecal/ecal/tree/hotfix/memfile_ack_timeout_5_11

Currently, this branch only increases the internal timeout making it more unlikely to miss it.

@FlorianReimold
Copy link
Member Author

FlorianReimold commented Sep 21, 2023

No idea why, but this is what has happened. Could be the testing tool ecal_perftool, not sure, yet. But for sure, the ack_timeout also doesn not what it is supposed to do.

  • For some reason, the limiting 100ms subscriber seems to miss every second frame, even though the ack_timeout is set to 500ms
  • For some reason, one of my 10ms subscribers gets all the messages packed together in a shorter time resulting in a higher computed frequency. You can see, that it receives ~13 messages a second, but the higher computed frequency shows that the time between the first and the last message must be shorter.

As always, the issue created itself after some time.

The test was performed with the current HEAD of https:/eclipse-ecal/ecal/tree/hotfix/shm_data_drop
This shows, that there still is an issue. We haven't fixed it, but we have relieved the effect.

1 x publisher

./ecal_perftool pub topic 25 1024000

[  4355.032] | cnt:      9 | mean_loop_dt (ms): 101.478 | loop_freq (Hz):     9.9 | mean_snd_dt (ms): 101.401
[  4356.033] | cnt:     10 | mean_loop_dt (ms): 101.202 | loop_freq (Hz):     9.9 | mean_snd_dt (ms): 101.153
[  4357.033] | cnt:     10 | mean_loop_dt (ms): 101.278 | loop_freq (Hz):     9.9 | mean_snd_dt (ms): 101.333
[  4358.034] | cnt:     10 | mean_loop_dt (ms): 101.408 | loop_freq (Hz):     9.9 | mean_snd_dt (ms): 101.473
[  4359.122] | cnt:      5 | mean_loop_dt (ms): 101.102 | loop_freq (Hz):     9.9 | mean_snd_dt (ms): 101.182
[  4360.123] | cnt:     14 | mean_loop_dt (ms): 121.973 | loop_freq (Hz):     8.2 | mean_snd_dt (ms):  92.370
[  4361.124] | cnt:     13 | mean_loop_dt (ms):  78.496 | loop_freq (Hz):    12.7 | mean_snd_dt (ms):  49.202
[  4362.124] | cnt:     14 | mean_loop_dt (ms):  75.455 | loop_freq (Hz):    13.3 | mean_snd_dt (ms):  46.485
[  4363.124] | cnt:     13 | mean_loop_dt (ms):  78.439 | loop_freq (Hz):    12.7 | mean_snd_dt (ms):  49.053
[  4364.125] | cnt:     14 | mean_loop_dt (ms):  75.445 | loop_freq (Hz):    13.3 | mean_snd_dt (ms):  46.402

4 x 10ms subscriber:

./ecal_perftool sub topic 10

[  4355.032] | cnt:     10 | lost:      0 | msg_dt_mean (ms): 101.383 | msg_freq (Hz):     9.9
[  4356.033] | cnt:     10 | lost:      0 | msg_dt_mean (ms): 101.252 | msg_freq (Hz):     9.9
[  4357.033] | cnt:     10 | lost:      0 | msg_dt_mean (ms): 101.317 | msg_freq (Hz):     9.9
[  4358.034] | cnt:     10 | lost:      0 | msg_dt_mean (ms): 101.439 | msg_freq (Hz):     9.9
[  4359.124] | cnt:      5 | lost:      0 | msg_dt_mean (ms): 101.091 | msg_freq (Hz):     9.9
[  4360.124] | cnt:     14 | lost:      0 | msg_dt_mean (ms):  75.396 | msg_freq (Hz):    13.3
[  4361.124] | cnt:     13 | lost:      0 | msg_dt_mean (ms):  78.415 | msg_freq (Hz):    12.8
[  4362.125] | cnt:     14 | lost:      0 | msg_dt_mean (ms):  75.511 | msg_freq (Hz):    13.2
[  4363.125] | cnt:     13 | lost:      0 | msg_dt_mean (ms):  78.357 | msg_freq (Hz):    12.8
[  4364.125] | cnt:     14 | lost:      0 | msg_dt_mean (ms):  75.406 | msg_freq (Hz):    13.3

1 x 10ms subscriber that behaves substantially different and suddenly received more than what is published!

./ecal_perftool sub topic 10

[  4355.365] | cnt:     10 | lost:      0 | msg_dt_mean (ms): 101.324 | msg_freq (Hz):     9.9
[  4356.365] | cnt:     10 | lost:      0 | msg_dt_mean (ms): 101.035 | msg_freq (Hz):     9.9
[  4357.366] | cnt:      9 | lost:      0 | msg_dt_mean (ms): 101.414 | msg_freq (Hz):     9.9
[  4358.367] | cnt:     10 | lost:      0 | msg_dt_mean (ms): 101.375 | msg_freq (Hz):     9.9
[  4359.367] | cnt:      4 | lost:      0 | msg_dt_mean (ms): 262.172 | msg_freq (Hz):     3.8
[  4360.368] | cnt:     13 | lost:      0 | msg_dt_mean (ms):  40.071 | msg_freq (Hz):    25.0
[  4361.368] | cnt:     13 | lost:      0 | msg_dt_mean (ms):  40.004 | msg_freq (Hz):    25.0
[  4362.369] | cnt:     14 | lost:      0 | msg_dt_mean (ms):  40.013 | msg_freq (Hz):    25.0
[  4363.370] | cnt:     13 | lost:      0 | msg_dt_mean (ms):  40.063 | msg_freq (Hz):    25.0
[  4364.370] | cnt:     14 | lost:      0 | msg_dt_mean (ms):  40.060 | msg_freq (Hz):    25.0

1 x 100ms subscriber

./ecal_perftool sub topic 100

[  4355.032] | cnt:      9 | lost:      0 | msg_dt_mean (ms): 101.366 | msg_freq (Hz):     9.9
[  4356.033] | cnt:     10 | lost:      0 | msg_dt_mean (ms): 101.189 | msg_freq (Hz):     9.9
[  4357.033] | cnt:     10 | lost:      0 | msg_dt_mean (ms): 101.325 | msg_freq (Hz):     9.9
[  4358.034] | cnt:     10 | lost:      0 | msg_dt_mean (ms): 101.391 | msg_freq (Hz):     9.9
[  4359.124] | cnt:      6 | lost:      0 | msg_dt_mean (ms): 101.162 | msg_freq (Hz):     9.9
[  4360.124] | cnt:      6 | lost:      4 | msg_dt_mean (ms): 168.551 | msg_freq (Hz):     5.9
[  4361.124] | cnt:      6 | lost:      5 | msg_dt_mean (ms): 176.479 | msg_freq (Hz):     5.7
[  4362.125] | cnt:      6 | lost:      5 | msg_dt_mean (ms): 175.789 | msg_freq (Hz):     5.7
[  4363.125] | cnt:      7 | lost:      5 | msg_dt_mean (ms): 153.267 | msg_freq (Hz):     6.5
[  4364.125] | cnt:      6 | lost:      5 | msg_dt_mean (ms): 176.047 | msg_freq (Hz):     5.7

eCAL Mon shows a little different picture:

image

image

processes {
  rclock: 5483
  hname: "florian-ubuntu20"
  pid: 5279
  pname: "/home/florian/Projects/_build/release/ecal_perftool"
  uname: "ecal-perftool"
  pparam: "./ecal_perftool pub topic 25 1024000"
  pmemory: 435613696
  usrptime: -1
  datawrite: 14336000
  state {
  }
  tsync_state: tsync_realtime
  tsync_mod_name: "\"ecaltime-localtime\""
  component_init_state: 183
  component_init_info: "pub|sub|srv|log|time"
  ecal_runtime_version: "v5.12.0-nightly2-71-gac103991e"
  hgname: "florian-ubuntu20"
}
processes {
  rclock: 5483
  hname: "florian-ubuntu20"
  pid: 5290
  pname: "/home/florian/Projects/_build/release/ecal_perftool"
  uname: "ecal-perftool"
  pparam: "./ecal_perftool sub topic 10"
  pmemory: 585547776
  usrptime: -1
  dataread: 14336000
  state {
  }
  tsync_state: tsync_realtime
  tsync_mod_name: "\"ecaltime-localtime\""
  component_init_state: 183
  component_init_info: "pub|sub|srv|log|time"
  ecal_runtime_version: "v5.12.0-nightly2-71-gac103991e"
  hgname: "florian-ubuntu20"
}
processes {
  rclock: 5483
  hname: "florian-ubuntu20"
  pid: 5303
  pname: "/home/florian/Projects/_build/release/ecal_perftool"
  uname: "ecal-perftool"
  pparam: "./ecal_perftool sub topic 10"
  pmemory: 510046208
  usrptime: -1
  dataread: 14336000
  state {
  }
  tsync_state: tsync_realtime
  tsync_mod_name: "\"ecaltime-localtime\""
  component_init_state: 183
  component_init_info: "pub|sub|srv|log|time"
  ecal_runtime_version: "v5.12.0-nightly2-71-gac103991e"
  hgname: "florian-ubuntu20"
}
processes {
  rclock: 5483
  hname: "florian-ubuntu20"
  pid: 5315
  pname: "/home/florian/Projects/_build/release/ecal_perftool"
  uname: "ecal-perftool"
  pparam: "./ecal_perftool sub topic 10"
  pmemory: 510046208
  usrptime: -1
  dataread: 14336000
  state {
  }
  tsync_state: tsync_realtime
  tsync_mod_name: "\"ecaltime-localtime\""
  component_init_state: 183
  component_init_info: "pub|sub|srv|log|time"
  ecal_runtime_version: "v5.12.0-nightly2-71-gac103991e"
  hgname: "florian-ubuntu20"
}
processes {
  rclock: 5483
  hname: "florian-ubuntu20"
  pid: 5327
  pname: "/home/florian/Projects/_build/release/ecal_perftool"
  uname: "ecal-perftool"
  pparam: "./ecal_perftool sub topic 10"
  pmemory: 510046208
  usrptime: -1
  dataread: 14336000
  state {
  }
  tsync_state: tsync_realtime
  tsync_mod_name: "\"ecaltime-localtime\""
  component_init_state: 183
  component_init_info: "pub|sub|srv|log|time"
  ecal_runtime_version: "v5.12.0-nightly2-71-gac103991e"
  hgname: "florian-ubuntu20"
}
processes {
  rclock: 5482
  hname: "florian-ubuntu20"
  pid: 5353
  pname: "/home/florian/Projects/_build/release/ecal_perftool"
  uname: "ecal-perftool"
  pparam: "./ecal_perftool sub topic 10"
  pmemory: 510046208
  usrptime: -1
  dataread: 14336000
  state {
  }
  tsync_state: tsync_realtime
  tsync_mod_name: "\"ecaltime-localtime\""
  component_init_state: 183
  component_init_info: "pub|sub|srv|log|time"
  ecal_runtime_version: "v5.12.0-nightly2-71-gac103991e"
  hgname: "florian-ubuntu20"
}
processes {
  rclock: 5483
  hname: "florian-ubuntu20"
  pid: 5373
  pname: "/home/florian/Projects/_build/release/ecal_perftool"
  uname: "ecal-perftool"
  pparam: "./ecal_perftool sub topic 100"
  pmemory: 510046208
  usrptime: -1
  dataread: 6144000
  state {
  }
  tsync_state: tsync_realtime
  tsync_mod_name: "\"ecaltime-localtime\""
  component_init_state: 183
  component_init_info: "pub|sub|srv|log|time"
  ecal_runtime_version: "v5.12.0-nightly2-71-gac103991e"
  hgname: "florian-ubuntu20"
}
processes {
  rclock: 5483
  hname: "florian-ubuntu20"
  pid: 5436
  pname: "/usr/bin/ecal_mon_gui-5.12.0"
  uname: "eCALMon"
  pparam: "/usr/bin/ecal_mon_gui"
  pmemory: 934182912
  usrptime: -1
  state {
    severity: proc_sev_healthy
    info: "Running"
  }
  tsync_state: tsync_realtime
  tsync_mod_name: "\"ecaltime-localtime\""
  component_init_state: 191
  component_init_info: "pub|sub|srv|mon|log|time"
  ecal_runtime_version: "v5.12.0-nightly2-71-gac103991e"
  hgname: "florian-ubuntu20"
}
topics {
  rclock: 5483
  hname: "florian-ubuntu20"
  pid: 5279
  pname: "/home/florian/Projects/_build/release/ecal_perftool"
  uname: "ecal-perftool"
  tid: "1624945319207"
  tname: "topic"
  direction: "publisher"
  tlayer {
    type: tl_ecal_shm
    confirmed: true
  }
  tsize: 1024000
  connections_loc: 6
  dclock: 65106
  dfreq: 14000
  hgname: "florian-ubuntu20"
  tdatatype {
  }
}
topics {
  rclock: 5483
  hname: "florian-ubuntu20"
  pid: 5290
  pname: "/home/florian/Projects/_build/release/ecal_perftool"
  uname: "ecal-perftool"
  tid: "1628043986590"
  tname: "topic"
  direction: "subscriber"
  tlayer {
    type: tl_ecal_shm
    confirmed: true
  }
  tsize: 1024000
  connections_loc: 1
  dclock: 65003
  dfreq: 14000
  hgname: "florian-ubuntu20"
  tdatatype {
  }
}
topics {
  rclock: 5483
  hname: "florian-ubuntu20"
  pid: 5303
  pname: "/home/florian/Projects/_build/release/ecal_perftool"
  uname: "ecal-perftool"
  tid: "1629142832302"
  tname: "topic"
  direction: "subscriber"
  tlayer {
    type: tl_ecal_shm
    confirmed: true
  }
  tsize: 1024000
  connections_loc: 1
  dclock: 64976
  dfreq: 14000
  hgname: "florian-ubuntu20"
  tdatatype {
  }
}
topics {
  rclock: 5483
  hname: "florian-ubuntu20"
  pid: 5315
  pname: "/home/florian/Projects/_build/release/ecal_perftool"
  uname: "ecal-perftool"
  tid: "1630083352274"
  tname: "topic"
  direction: "subscriber"
  tlayer {
    type: tl_ecal_shm
    confirmed: true
  }
  tsize: 1024000
  connections_loc: 1
  dclock: 64952
  dfreq: 14000
  hgname: "florian-ubuntu20"
  tdatatype {
  }
}
topics {
  rclock: 5483
  hname: "florian-ubuntu20"
  pid: 5327
  pname: "/home/florian/Projects/_build/release/ecal_perftool"
  uname: "ecal-perftool"
  tid: "1631091796732"
  tname: "topic"
  direction: "subscriber"
  tlayer {
    type: tl_ecal_shm
    confirmed: true
  }
  tsize: 1024000
  connections_loc: 1
  dclock: 64943
  dfreq: 14000
  hgname: "florian-ubuntu20"
  tdatatype {
  }
}
topics {
  rclock: 5482
  hname: "florian-ubuntu20"
  pid: 5353
  pname: "/home/florian/Projects/_build/release/ecal_perftool"
  uname: "ecal-perftool"
  tid: "1645633128470"
  tname: "topic"
  direction: "subscriber"
  tlayer {
    type: tl_ecal_shm
    confirmed: true
  }
  tsize: 1024000
  connections_loc: 1
  dclock: 64574
  dfreq: 14000
  hgname: "florian-ubuntu20"
  tdatatype {
  }
}
topics {
  rclock: 5483
  hname: "florian-ubuntu20"
  pid: 5373
  pname: "/home/florian/Projects/_build/release/ecal_perftool"
  uname: "ecal-perftool"
  tid: "1654095462610"
  tname: "topic"
  direction: "subscriber"
  tlayer {
    type: tl_ecal_shm
    confirmed: true
  }
  tsize: 1024000
  connections_loc: 1
  message_drops: 33478
  dclock: 44831
  dfreq: 6000
  hgname: "florian-ubuntu20"
  tdatatype {
  }
}

@FlorianReimold
Copy link
Member Author

FlorianReimold commented Sep 27, 2023

So I now have a way of immediately triggering the issue. The setup is as follows:

  • Ubuntu 20.04 (probably other Ubuntus as well)
  • eCAL Branch hotfix/shm_data-drop
  • ecal-perftool with --hickup support FlorianReimold/ecal-perftool@66bfce4
  • ecal.ini with ack timeout of 500

The only thing we have to do is make the subscriber miss the ack timeout once. After missing it once, the entire ack feature breaks down entirely and never recovers.

  1. ecal_perftool pub topic 25 1024
  2. ecal_perftool sub topic 100 --hickup 1500 550

Now watch the eCAL ack feature breaking. I think we can say, that SHM acks are pretty much useless on Ubuntu at the moment, as only 1 missed ack destroys the entire feature. I don't have a solution, yet, but on Windows it doesn't show the same effect, so there must be some solution.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants