Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove catchall-rethrow indirections in the Device state wrappers and as a consequence also the ERROR transition #372

Open
dennisklein opened this issue Jun 21, 2021 · 4 comments
Labels
feature New feature or request
Milestone

Comments

@dennisklein
Copy link
Member

dennisklein commented Jun 21, 2021

In an offline discussion we (@rbx, @davidrohr, and myself) agreed (correct me if I am wrong) that having a cooperative signaling about an imminent terminate inside the about-to-terminate process is redundant, because a controller has to monitor and handle terminate events anyways from outside. Removing the internal signaling logic (catchall-transition-to-ERROR-rethrow) will improve debuggability.

This likely opens the possibility to remove the FairMQ ERROR state completely AFAICS. If we do not find any surprises, I suggest to remove it together with the change above.

Once we have a PR ready, let us collect approval by all Alice O2 subsystems, that might potentially be affected, since I really lost overview who could be relying on the ERROR state): EPN, DD, DPL, QC, FLP. Anyone else?

@dennisklein dennisklein added the feature New feature or request label Jun 21, 2021
@rbx
Copy link
Member

rbx commented Jun 21, 2021

I think to remove it completely we should at least have something equivalent in the SDK that would notify of the termination. Something on the controller side or in between should switch the device state from whatever it was when error happened to an Error State or similar.

I think leaving the ErrorState here is the simplest, even if the device doesn't automatically go into it (which we anyway cannot guarantee for crashes/terminations). Even if only for a reason that some controller may already have it in use. But perhaps it needs to be extended with further infos, like exit code.

@dennisklein
Copy link
Member Author

dennisklein commented Jun 21, 2021

I think to remove it completely we should at least have something equivalent in the SDK that would notify of the termination. Something on the controller side or in between should switch the device state from whatever it was when error happened to an Error State or similar.

I think leaving the ErrorState here is the simplest, even if the device doesn't automatically go into it (which we anyway cannot guarantee for crashes/terminations). Even if only for a reason that some controller may already have it in use. But perhaps it needs to be extended with further infos, like exit code.

True, then let's revisit a possible removal of the ERROR state once we have proper DDS APIs to monitor the task termination from the SDK. (cc: @AnarManafov, @AndreyLebedev just FYI)

@AnarManafov
Copy link

AnarManafov commented Jun 21, 2021

@dennisklein , DDS ToolAPI now supports subscription on onTaskDone events. I have finished the implementation and pushed the code to the master last week.
Example:

// create a session
SOnTaskDoneRequest::request_t request;
SOnTaskDoneRequest::ptr_t requestPtr = SOnTaskDoneRequest::makeRequest(request);
int nTaskDoneCount{ 0 };
requestPtr->setResponseCallback(
               [&nTaskDoneCount](const SOnTaskDoneResponseData& _info)
               {
                 ++nTaskDoneCount;
                 cout << "Recieved onTaskDone event. TaskID: " << _info.m_taskID
                      << " ; ExitCode: " << _info.m_exitCode
                      << " ; Signal: " << _info.m_signal;
               });
 session.sendRequest<SOnTaskDoneRequest>(requestPtr);
  ... 

@dennisklein
Copy link
Member Author

for reference: AliceO2Group/AliceO2#6499

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants