Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Regression in Rolling: error "cannot publish data, at ./src/rmw_publish.cpp:66" during shutdown of binary #761

Closed
rhaschke opened this issue May 29, 2024 · 17 comments

Comments

@rhaschke
Copy link

Bug report

In Rolling, I observe weird segfaults during shutdown of the test binary:

2024-05-26T18:55:07.8227778Z 12: [moveit_task_constructor_core-test-move-relative-1] [==========] 3 tests from 1 test suite ran. (138 ms total)
2024-05-26T18:55:07.8228588Z 12: [moveit_task_constructor_core-test-move-relative-1] [  PASSED  ] 3 tests.
2024-05-26T18:55:07.8229520Z 12: [moveit_task_constructor_core-test-move-relative-1] cannot publish data, at ./src/rmw_publish.cpp:66 during '__function__'
2024-05-26T18:55:07.8230659Z 12: [moveit_task_constructor_core-test-move-relative-1] Fail in delete datareader, at ./src/rmw_service.cpp:89 during '__function__'
2024-05-26T18:55:07.8233380Z 12: [ERROR] [moveit_task_constructor_core-test-move-relative-1]: process has died [pid 5393, exit code -11, cmd '/home/runner/work/moveit_task_constructor/moveit_task_constructor/.work/target_ws/build/moveit_task_constructor_core/test/moveit_task_constructor_core-test-move-relative --ros-args --params-file /tmp/launch_params_ujuovdp5 --params-file /tmp/launch_params_htlg3d0o --params-file /tmp/launch_params_4g9zqee6 --params-file /tmp/launch_params_niamvcg5'].

Humble and Iron don't exhibit this issue.
WORKAROUND: Bypassing normal cleanup routines by calling quick_exit().

Required Info:

  • Operating System: Ubuntu 24.04
  • Installation type: deb package
  • Version or commit hash: 8.5.0
  • DDS implementation:fastrtps
  • Client library (if applicable): rclcpp

Steps to reproduce issue

git clone -b https:/ros-planning/moveit_task_constructor.git src/mtc
(cd src/mtc; git checkout bb047c8)
docker run --rm -it -v $PWD/src:/root/ws_moveit/src  moveit/moveit2:rolling-source

in docker:

colcon build --mixin rel-with-deb-info
python3 -m launch_testing.launch_test /root/ws_moveit/src/mtc/core/test/test.launch.py test_binary:=/root/ws_moveit/build/moveit_task_constructor_core/test/moveit_task_constructor_core-test-move-relative

Expected behavior

Binary shuts down cleanly after tests.

Actual behavior

Binary throws error message and segfaults:

cannot publish data, at ./src/rmw_publish.cpp:66 during '__function__'
Fail in delete datareader, at ./src/rmw_service.cpp:89 during '__function__'
@rhaschke
Copy link
Author

rhaschke commented May 29, 2024

Although this issue exhibits the same error message and behavior as #760, it seems to have a different source, as the corresponding Iron build works fine for me.

@fujitatomoya
Copy link
Collaborator

CC: @Barry-Xu-2018

@fujitatomoya
Copy link
Collaborator

@rhaschke quick question. do you have this problem only with rmw_fastrtps? not with export RMW_IMPLEMENTATION=rmw_connextdds nor export RMW_IMPLEMENTATION=rmw_cyclonedds_cpp?

@Barry-Xu-2018
Copy link
Contributor

@fujitatomoya

In my environment, when I switched to CycloneDDS, no issues came up.

@rhaschke I wonder if the test results in your environment are the same.
Besides, I can produce this issue. But I found there's another phenomenon --- "execution freezing."

I have to use ctrl + c to terminate execution.
@rhaschke Have you ever experienced this issue in your environment ?

root@Test-Laptop:~/ws_moveit# python3 -m launch_testing.launch_test /root/ws_moveit/src/mtc/core/test/test.launch.py test_binary:=/root/ws_moveit/build/moveit_task_constructor_core/test/moveit_task_constructor_core-test-move-relative
[INFO] [launch]: All log files can be found below /root/.ros/log/2024-05-31-06-19-31-601646-Test-Laptop-262
[INFO] [launch]: Default logging verbosity is set to INFO
test_gtest_run_complete (test.launch.TestTerminatingProcessStops.test_gtest_run_complete) ... [INFO] [moveit_task_constructor_core-test-move-relative-1]: process started with pid [266]
[INFO] [python3-2]: process started with pid [267]
[moveit_task_constructor_core-test-move-relative-1] [==========] Running 3 tests from 1 test suite.
[moveit_task_constructor_core-test-move-relative-1] [----------] Global test environment set-up.
[moveit_task_constructor_core-test-move-relative-1] [----------] 3 tests from PandaMoveRelative
[moveit_task_constructor_core-test-move-relative-1] [ RUN      ] PandaMoveRelative.cartesianRotateEEF
[moveit_task_constructor_core-test-move-relative-1] [INFO] [1717136371.782343353] [moveit_2320786753.moveit.ros.rdf_loader]: Loaded robot model in 0.00245061 seconds
[moveit_task_constructor_core-test-move-relative-1] [INFO] [1717136371.782433490] [moveit_2320786753.moveit.core.robot_model]: Loading robot model 'panda'...
[moveit_task_constructor_core-test-move-relative-1] [WARN] [1717136371.790096255] [kdl_parser]: The root link panda_link0 has an inertia specified in the URDF, but KDL does not support a root link with an inertia.  As a workaround, you can add an extra dummy link to your URDF.
[moveit_task_constructor_core-test-move-relative-1] [INFO] [1717136371.790162217] [moveit_2320786753.moveit.kinematics.kdl_kinematics_plugin]: Joint weights for group 'panda_arm': 1 1 1 1 1 1 1
[moveit_task_constructor_core-test-move-relative-1] [       OK ] PandaMoveRelative.cartesianRotateEEF (75 ms)
[moveit_task_constructor_core-test-move-relative-1] [ RUN      ] PandaMoveRelative.cartesianCircular
[moveit_task_constructor_core-test-move-relative-1] [WARN] [1717136371.802597297] [rcl.logging_rosout]: Publisher already registered for node name: 'introspection_Test_Laptop_266_100389300798272'. If this is due to multiple nodes with the same name then all logs for the logger named 'introspection_Test_Laptop_266_100389300798272' will go out over the existing publisher. As soon as any node with that name is destructed it will unregister the publisher, preventing any further logs for that name from being published on the rosout topic.
[moveit_task_constructor_core-test-move-relative-1] [INFO] [1717136371.811545153] [moveit_2320786753.moveit.ros.rdf_loader]: Loaded robot model in 0.00260597 seconds
[moveit_task_constructor_core-test-move-relative-1] [INFO] [1717136371.811592841] [moveit_2320786753.moveit.core.robot_model]: Loading robot model 'panda'...
[moveit_task_constructor_core-test-move-relative-1] [WARN] [1717136371.819243163] [kdl_parser]: The root link panda_link0 has an inertia specified in the URDF, but KDL does not support a root link with an inertia.  As a workaround, you can add an extra dummy link to your URDF.
[moveit_task_constructor_core-test-move-relative-1] [INFO] [1717136371.819340954] [moveit_2320786753.moveit.kinematics.kdl_kinematics_plugin]: Joint weights for group 'panda_arm': 1 1 1 1 1 1 1
[moveit_task_constructor_core-test-move-relative-1] [       OK ] PandaMoveRelative.cartesianCircular (27 ms)
[moveit_task_constructor_core-test-move-relative-1] [ RUN      ] PandaMoveRelative.cartesianRotateAttachedIKFrame
[moveit_task_constructor_core-test-move-relative-1] [INFO] [1717136371.837962508] [moveit_2320786753.moveit.ros.rdf_loader]: Loaded robot model in 0.00202064 seconds
[moveit_task_constructor_core-test-move-relative-1] [INFO] [1717136371.837986683] [moveit_2320786753.moveit.core.robot_model]: Loading robot model 'panda'...
[moveit_task_constructor_core-test-move-relative-1] [WARN] [1717136371.844279933] [kdl_parser]: The root link panda_link0 has an inertia specified in the URDF, but KDL does not support a root link with an inertia.  As a workaround, you can add an extra dummy link to your URDF.
[moveit_task_constructor_core-test-move-relative-1] [INFO] [1717136371.844309908] [moveit_2320786753.moveit.kinematics.kdl_kinematics_plugin]: Joint weights for group 'panda_arm': 1 1 1 1 1 1 1
[moveit_task_constructor_core-test-move-relative-1] [       OK ] PandaMoveRelative.cartesianRotateAttachedIKFrame (24 ms)
[moveit_task_constructor_core-test-move-relative-1] [----------] 3 tests from PandaMoveRelative (127 ms total)
[moveit_task_constructor_core-test-move-relative-1] 
[moveit_task_constructor_core-test-move-relative-1] [----------] Global test environment tear-down
[moveit_task_constructor_core-test-move-relative-1] [==========] 3 tests from 1 test suite ran. (127 ms total)
[moveit_task_constructor_core-test-move-relative-1] [  PASSED  ] 3 tests.

@rhaschke
Copy link
Author

Do you have this problem only with rmw_fastrtps?

Yes. I don't face the problem with Cyclone. I was not able to test with Connext as the middleware implementation is not found for some reason. I tried both, RMW_IMPLEMENTATION=rmw_connextdds and RMW_IMPLEMENTATION=rmw_connext_cpp.

@rhaschke
Copy link
Author

I never observed execution freezing as reported by @Barry-Xu-2018.

@Barry-Xu-2018
Copy link
Contributor

After checking test code, I find rclcpp::shutdown() is never called. There weren't any issues before, so it might have just been luck.
https:/moveit/moveit_task_constructor/blob/bb047c894cd2a395a672abdd47ec76d709d1cb1e/core/test/test_move_relative.cpp

To quickly verify this, I added rclcpp::shutdown() at the end of the last test cartesianRotateAttachedIKFrame.
I try to add rclcpp::shutdown()

https:/moveit/moveit_task_constructor/blob/bb047c894cd2a395a672abdd47ec76d709d1cb1e/core/test/test_move_relative.cpp#L116-L130

I tested modified codes 10 times in my environment and the issue didn't come up again.
@rhaschke maybe you can try it.

@rhaschke
Copy link
Author

Adding rclcpp::shutdown() in the last test indeed solves the problem somehow.
I tried adding shutdown() before, but in main() (for better symmetry), which didn't change anything!
Why is it important where shutdown is called? I still consider this a bug in fastrtps.

@rhaschke
Copy link
Author

Calling shutdown() in the last test instead of main, means that the Node is still existing when shutdown is called. Maybe this gives a hint to the problem?

@Barry-Xu-2018
Copy link
Contributor

Adding rclcpp::shutdown() in the last test indeed solves the problem somehow.

I just wanted to quickly check an idea. Putting a shutdown in a test isn't really appropriate.

I tried adding shutdown() before, but in main() (for better symmetry), which didn't change anything!

Yes. This is a problem.

Why is it important where shutdown is called? I still consider this a bug in fastrtps.

It is used to gracefully shut down ROS 2 nodes and associated resources (The correct order of release).|

Calling shutdown() in the last test instead of main, means that the Node is still existing when shutdown is called. Maybe this gives a hint to the problem?

Yes, there are other reasons too. I'll continue to investigate this.

@Barry-Xu-2018
Copy link
Contributor

The failure is related to

https:/moveit/moveit_task_constructor/blob/bb047c894cd2a395a672abdd47ec76d709d1cb1e/core/src/introspection.cpp#L160

Since I'm not very familiar with the relevant code, I don't quite understand the use of static here.

When the introspection node is removed, the introspection node resources are not released immediately. Instead, it is released when the static IntrospectionExecutor is destroyed (At this time, it seems that fastdds resources are already gone).

The following modification is a workaround.
https:/moveit/moveit_task_constructor/blob/bb047c894cd2a395a672abdd47ec76d709d1cb1e/core/src/introspection.cpp#L89-L97

	void remove_node(const rclcpp::Node::SharedPtr& node) {
		std::lock_guard<std::mutex> lock(mutex_);
		executor_->remove_node(node);
		if (--nodes_count_ == 0) {
			executor_->cancel();
			if (executor_thread_.joinable())
				executor_thread_.join();
                                executor_ = rclcpp::executors::SingleThreadedExecutor::make_unique();
                                 // ^^^  Adding this line will also ensure that the introspection node resources are released immediately and correctly.
		}
	}

I will continue to investigate, after the introspection node is removed from the executor, and why the introspection node's resources are still being released when the executor's resources are released.

@rhaschke
Copy link
Author

rhaschke commented Jun 4, 2024

Thanks for tracking this down to the static executor, which is indeed bad practise. I'm still wondering, why this worked in Humble.

@Barry-Xu-2018
Copy link
Contributor

Barry-Xu-2018 commented Jun 4, 2024

Thanks for tracking this down to the static executor, which is indeed bad practise. I'm still wondering, why this worked in Humble.

I guess it is related to this change ros2/rclcpp#2143.
While removing node, it will be put into pending_removed_nodes_.
https:/ros2/rclcpp/blob/d8d83a0ee67b7b6427defc802005c382b218459d/rclcpp/src/rclcpp/executors/executor_entities_collector.cpp#L110-L133
In this case, it calls executor->cancel() at once after removing node.
This leads node isn't removed. While executor is destructed, this node will be removed.

@Barry-Xu-2018
Copy link
Contributor

For fastdds, when the program ends, fastdds resources start to release by Destructor.
Note that the destruction of the static executor happens after this process. This results in an exception when the executor tries to access fastdds resources during its destruction.

...
    at /root/ws_moveit/src/Fast-DDS/src/cpp/fastdds/domain/DomainParticipantFactory.cpp:73
#10 0x00007ffff08a8cb4 in eprosima::fastdds::dds::DomainParticipantFactory::~DomainParticipantFactory (this=0x5555557a0680, __in_chrg=<optimized out>)
    at /root/ws_moveit/src/Fast-DDS/src/cpp/fastdds/domain/DomainParticipantFactory.cpp:86
#11 0x00007ffff08a8d4c in operator() (__closure=0x5555557a13f0, p=0x5555557a0680) at /root/ws_moveit/src/Fast-DDS/src/cpp/fastdds/domain/DomainParticipantFactory.cpp:100
#12 0x00007ffff08ad3e2 in std::_Sp_counted_deleter<eprosima::fastdds::dds::DomainParticipantFactory*, eprosima::fastdds::dds::DomainParticipantFactory::get_shared_instance()::<lambda(eprosima::fastdds::dds::DomainParticipantFactory*)>, std::allocator<void>, (__gnu_cxx::_Lock_policy)2>::_M_dispose(void) (this=0x5555557a13e0) at /usr/include/c++/13/bits/shared_ptr_base.h:527
#13 0x00007ffff0586833 in std::_Sp_counted_base<(__gnu_cxx::_Lock_policy)2>::_M_release (this=0x5555557a13e0) at /usr/include/c++/13/bits/shared_ptr_base.h:346
#14 0x00007ffff0586d83 in std::__shared_count<(__gnu_cxx::_Lock_policy)2>::~__shared_count (this=0x7ffff1393548 <eprosima::fastdds::dds::DomainParticipantFactory::get_shared_instance()::instance+8>, 
    __in_chrg=<optimized out>) at /usr/include/c++/13/bits/shared_ptr_base.h:1071
#15 0x00007ffff08aea5c in std::__shared_ptr<eprosima::fastdds::dds::DomainParticipantFactory, (__gnu_cxx::_Lock_policy)2>::~__shared_ptr (
    this=0x7ffff1393540 <eprosima::fastdds::dds::DomainParticipantFactory::get_shared_instance()::instance>, __in_chrg=<optimized out>) at /usr/include/c++/13/bits/shared_ptr_base.h:1524
#16 0x00007ffff08aea7c in std::shared_ptr<eprosima::fastdds::dds::DomainParticipantFactory>::~shared_ptr (
    this=0x7ffff1393540 <eprosima::fastdds::dds::DomainParticipantFactory::get_shared_instance()::instance>, __in_chrg=<optimized out>) at /usr/include/c++/13/bits/shared_ptr.h:175
#17 0x00007ffff628ea66 in __run_exit_handlers (status=0, listp=<optimized out>, run_list_atexit=run_list_atexit@entry=true, run_dtors=run_dtors@entry=true) at ./stdlib/exit.c:108
#18 0x00007ffff628ebae in __GI_exit (status=<optimized out>) at ./stdlib/exit.c:138
#19 0x00007ffff62711d1 in __libc_start_call_main (main=main@entry=0x555555578ae9 <main(int, char**)>, argc=argc@entry=10, argv=argv@entry=0x7fffffffbfd8) at ../sysdeps/nptl/libc_start_call_main.h:74
#20 0x00007ffff627128b in __libc_start_main_impl (main=0x555555578ae9 <main(int, char**)>, argc=10, argv=0x7fffffffbfd8, init=<optimized out>, fini=<optimized out>, rtld_fini=<optimized out>, 
    stack_end=0x7fffffffbfc8) at ../csu/libc-start.c:360
#21 0x0000555555577fd5 in _start ()

As for why cyclonedds doesn't have this issue, I believe it's because cyclonedds doesn't release its resources when the program exits (fastdds use Destructor). Therefore, when the static executor is being destructed, it can still access those resources. These resources are eventually reclaimed by the operating system.

So I think this code should be changed.
https:/moveit/moveit_task_constructor/blob/bb047c894cd2a395a672abdd47ec76d709d1cb1e/core/src/introspection.cpp#L160

@Barry-Xu-2018
Copy link
Contributor

In addition, I found, after canceling, the Executor did not correctly release shared pointer of resources in time. In this case, executor is static. So fastdds resources is released by Destructor before Executor free them.

Before return, wait_result_.reset() should be called.
https:/ros2/rclcpp/blob/d8d83a0ee67b7b6427defc802005c382b218459d/rclcpp/src/rclcpp/executor.cpp#L888
The above modification can also fix this issue.

After carefully verifying the impact, I will submit the fix to the rclcpp repository.

@Barry-Xu-2018
Copy link
Contributor

I have created a PR ros2/rclcpp#2556 to fix the issue in Executor.

@rhaschke
Copy link
Author

rhaschke commented Jul 9, 2024

I only found time to come back to this issue now. Removing the static executor resolves the issue for me.
Thanks, for investigating!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants