-
Notifications
You must be signed in to change notification settings - Fork 667
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
PVF: Refactor execute worker errors, treat more as internal #2604
Conversation
Since the errors now derive thiserror::Error they also get `Display` automatically. 🤠
A bit unrelated to this PR, but was it worth opening a new one? 🤔
This lint doesn't work with multiple targets (in the case of prepare-worker, the bench-only dependencies were messing it up). See: - rust-lang/rust#95513 - rust-lang/rust#57274 (comment)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me, brings much more structure into error processing. Left a comment with my doubts, but it's not a blocker.
@@ -149,10 +154,26 @@ pub fn worker_entrypoint( | |||
let worker_pid = process::id(); | |||
let artifact_path = worker_dir::execute_artifact(&worker_dir_path); | |||
|
|||
let Handshake { executor_params } = recv_execute_handshake(&mut stream)?; | |||
let Handshake { executor_params } = match recv_execute_handshake(&mut stream) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this change (and the following similar changes) worth it? It's a complication, not a big one, but still, a couple of dozens of lines. Does it make things better? I believe not. Neither handshake nor request contains untrusted data, and an error would mean everything went terribly wrong. It doesn't make much sense to me to communicate that back to the host. If the communication is so much broken at this point, is it even possible to communicate it back?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point! I can revert the HostCommunication
errors here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thinking about it more, these few lines can prevent disputes. And on the host-side we should also return an internal error. (That comment won't apply after the namespacing done in #2477.)
I know it's verbose but it seems like a low cost to prevent disputes. I just wonder if there's a way to make it less verbose.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like we can use map_err
to save one line. :P
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, one more thought, then. The worker main loop is a closure that returns io::Result<Never>
to the run_worker()
function. Maybe those errors should be handled on the run_worker()
level? That would still allow us to use ?
in the main loop and keep error handling simple. Not sure it makes total sense, but it probably is worth considering.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, that's a great idea. Right now we quit the workers on io::Error
but continue looping (most of the time) on other kinds of errors. But it should be that if any error occurs, the worker always quits. Then we could make run_worker
generic over both ExecuteResult and PrepareResult.
The question is, can the worker always just quit if an error occurred?
- execute: yes, an
Err
response always means that the worker dies (at least, it will after this refactor) - prepare: no, it’s not so clear in the un-refactored code here but we return an error if compilation fails. We still want to keep the current behavior (keeping the worker alive in that case).
I’d like to refactor the prepare errors which would enable the run_worker
refactor. But it’s more work which is probably going to become obsolete soon. It would also create more merge conflicts with the other PRs. So I will drop this idea to refactor run_worker
as well, although it’s a good idea and I would like to do it.
We can raise a follow-up just in case, and I'll push what I already implemented. There is some duplication of code, but this code may not need to be maintained for much longer, anyway, and it should make things a bit more robust in the meantime.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I realize that send_error
wrapping the error in an io::Error
is pretty weird. I thought about doing the refactor described above, but then I remembered that killing the prepare-worker on error causes the host to send a FromPool::Rip
signal. This may race with the actual error sent to the host from the worker before it dies, though I'm not sure what exactly happens. (Note that there is no race in the equivalent execute-worker code, so the code we have now works - internal errors should get reported.) Clearly this also requires a deeper look, so it should also be a follow-up. It can be addressed later if some of this code remains when PolkaVM is integrated.
bot fmt |
@mrcnski https://gitlab.parity.io/parity/mirrors/polkadot-sdk/-/jobs/4837728 was started for your command Comment |
@mrcnski Command |
@s0me0ne-unkn0wn what is the status of this? |
@bkchr this is a useful PR, but it diverged a lot from the current master and needs a lot of love to push it to merge. I wonder if any of the external contributors want to sort it out. CC @eagr @jpserrat @maksimryndin |
Hi @s0me0ne-unkn0wn, I can take this issue. What is the best way to handle the divergence: start a new branch and bring all the changes or try to resolve conflicts in this one? |
@maksimryndin great, very much appreciated! I'll assign it to you then. Don't forget to publish your Kusama address! |
It's up to you, I don't have a strong opinion here, it would depend on how much effort is needed to resolve those conflicts. Try it, maybe you'll find out that a new branch is a good idea :) |
@s0me0ne-unkn0wn I think it is ready for review #4071. I wonder as we change an encoding of messages between the host and workers, it shouldn't be a problem as the upgrade is done both for the main binary and workers, right? |
follow up of #2604 closes #2604 - [x] take relevant changes from Marcin's PR - [x] extract common duplicate code for workers (low-hanging fruits) ~Some unpassed ci problems are more general and should be fixed in master (see #4074 Proposed labels: **T0-node**, **R0-silent**, **I4-refactor** ----- kusama address: FZXVQLqLbFV2otNXs6BMnNch54CFJ1idpWwjMb3Z8fTLQC6 --------- Co-authored-by: s0me0ne-unkn0wn <[email protected]>
Hey @s0me0ne-unkn0wn, I was on vacation this last month 😞 . Let me know if you need help with another issue! |
Addresses points (1) and (2) of #2195. Recommended to review individual commits rather than the whole diff.
Instead of addressing points (3) and (4), I plan to move on after this PR. Preparation is not so security-critical, and it's also not clear how preparation will change with PolkaVM.
@eagr: Heads up, in case you're working on the error refactor.