Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix EntityComponentManager race condition #601

Merged
merged 4 commits into from
Feb 16, 2021
Merged

Fix EntityComponentManager race condition #601

merged 4 commits into from
Feb 16, 2021

Conversation

ddengster
Copy link
Contributor

@ddengster ddengster commented Feb 1, 2021

This PR attempts to fix an incredibly hard to reproduce race condition that results in a std::map::at() crash in the View::ComponentImplementation() function.

printouts show 2 threads trying to update the view.

[ign-13] sleeping in View::ComponentImplementation(), entity: 353
[ign-13] thread: 139693954283328
[ign-13] View::AddEntity entity added: 353
[ign-13] thread: 139693207320320
[ign-13] View::AddComponent entity: 353
[ign-13] thread: 139693207320320
[ign-13] View::AddComponent entity: 353
[ign-13] thread: 139693207320320
[ign-13] View::AddComponent entity: 353
[ign-13] thread: 139693207320320
[ign-13] View::AddComponent entity: 353

Gists for the stack trace of both threads:
https://gist.github.com/luca-della-vedova/44fc24aa3794c70ff814feccb05d4831
https://gist.github.com/ddengster/a77c7bf9ade79ceb51182c6296c053ae

From examining the gists it seems like there is one thread upon a plugin added event (GuiRunner::OnPluginAdded) creating a bunch of rendering entities plus accessing components, and the other thread is receiving a ignition::msgs::SerializedStepMap message and trying to rebuild views (clearing components map of the view). I've made sure both are not run in parallel in this PR and it seems to fix the issues (though my scene seems to be loading slower).

I've tried to push RebuildViews() to a delayed step similar to processing requests in the GuiRunner::OnState function but it seems that there are some dependencies on the results of that function, so I had to revert to this solution. It may be wise to re-think how threads access the ECM so as to prevent a whole lot of lockspam in the future.

@github-actions github-actions bot added the 🔮 dome Ignition Dome label Feb 1, 2021
@chapulina chapulina added the bug Something isn't working label Feb 1, 2021
Copy link
Contributor

@adlarkin adlarkin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Other than my comment about API/ABI below, you mentioned that this change causes a decrease in performance. Do you have any numbers that measure the change in performance before/after this change?

include/ignition/gazebo/EntityComponentManager.hh Outdated Show resolved Hide resolved
@ddengster
Copy link
Contributor Author

ddengster commented Feb 5, 2021

Other than my comment about API/ABI below, you mentioned that this change causes a decrease in performance. Do you have any numbers that measure the change in performance before/after this change?

My apologies, I meant scene loading performance. I don't have any numbers; just that before this fix I would have the scene show up as a whole, but with this fix the models show up in stages; so it might not be such a bad thing after all.

cc @luca-della-vedova have you seen this stage by stage loading behaviour with the clinic demo?

@chapulina
Copy link
Contributor

it seems like there is one thread upon a plugin added event (GuiRunner::OnPluginAdded) creating a bunch of rendering entities plus accessing components

I believe this is the underlying problem. We shouldn't be calling plugin->Update from a different thread. All update calls should be thread safe. Some possible alternatives:

  • OnPluginAdded could call RequestState to make sure the inserted plugin gets a fresh update ASAP.
  • We could have a mutex internal to GuiRunner that prevents multiple plugin->Updates from being called at the same time.

If one of these suggestions work, I think it could be more reliable than the current proposal, which requires each plugin to protect its own updates. We should make the updates thread-safe by default so plugins don't need to worry about it.

@ddengster ddengster closed this Feb 11, 2021
@ddengster ddengster reopened this Feb 11, 2021
@ddengster
Copy link
Contributor Author

ddengster commented Feb 11, 2021

If one of these suggestions work, I think it could be more reliable than the current proposal, which requires each plugin to protect its own updates. We should make the updates thread-safe by default so plugins don't need to worry about it.

Thanks for the suggestions. I've attempted the 2 suggestions:

We could have a mutex internal to GuiRunner that prevents multiple plugin->Updates from being called at the same time.

This ends up in a deadlock. My guess is some plugin is likely waiting for some results in another plugin.

OnPluginAdded could call RequestState to make sure the inserted plugin gets a fresh update ASAP.

This works great, I've put it into the PR but it ends up with messages that the ign transport node is not ready, see below.

[ign-13] [GUI] [Msg] Added plugin [Entity tree] to main window
[ign-13] Node::Advertise(): Error advertising service [/425482/state_async]. Did you forget to start the discovery service?
[ign-13] [GUI] [Msg] Loaded plugin [EntityTree] from path [/home/ddeng/ignition_ws/install/lib/ign-gazebo-4/plugins/gui/libEntityTree.so]
[ign-13] [GUI] [Dbg] [Application.cc:305] Loading plugin [toggle_charging]
[ign-13] [GUI] [Msg] Added plugin [Toggle Charging] to main window
[ign-13] Node::Advertise(): Error advertising service [/425482/state_async]. Did you forget to start the discovery service?

I havent found any function that allows me to wait on the transport layer node (or check if it's ready) in RequestState() . Still the request buffers collapse into the GuiRunner::OnState call, so every plugin updates and it all works 👍 . Maybe this improvement on the transport node layer could be done in a future patch.

Signed-off-by: ddengster <[email protected]>
@adlarkin adlarkin self-requested a review February 11, 2021 15:22
Copy link
Contributor

@chapulina chapulina left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it ends up with messages that the ign transport node is not ready,

I think that message is misleading, what's happening is that we're trying to advertise the same service several times. I took the liberty of pushing a fix in 8a3734f, let me know what you think. This is ready to merge if that works for you. Thanks!

@ddengster
Copy link
Contributor Author

I think that message is misleading, what's happening is that we're trying to advertise the same service several times. I took the liberty of pushing a fix in 8a3734f, let me know what you think. This is ready to merge if that works for you. Thanks!

You're welcome. The advertising error messages are gone now; let's merge it if there are no more outstanding problems.

@chapulina chapulina merged commit 215c8ed into gazebosim:ign-gazebo4 Feb 16, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working 🔮 dome Ignition Dome
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants