Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PVF: Refactor execute worker errors, treat more as internal #2604

Closed
wants to merge 13 commits into from

Conversation

mrcnski
Copy link
Contributor

@mrcnski mrcnski commented Dec 4, 2023

Addresses points (1) and (2) of #2195. Recommended to review individual commits rather than the whole diff.

Instead of addressing points (3) and (4), I plan to move on after this PR. Preparation is not so security-critical, and it's also not clear how preparation will change with PolkaVM.

@eagr: Heads up, in case you're working on the error refactor.

@mrcnski mrcnski added the T0-node This PR/Issue is related to the topic “node”. label Dec 4, 2023
@mrcnski mrcnski self-assigned this Dec 4, 2023
Since the errors now derive thiserror::Error they also get `Display`
automatically. 🤠
A bit unrelated to this PR, but was it worth opening a new one? 🤔
This lint doesn't work with multiple targets (in the case of prepare-worker, the
bench-only dependencies were messing it up). See:

- rust-lang/rust#95513
- rust-lang/rust#57274 (comment)
@mrcnski mrcnski added the R0-silent Changes should not be mentioned in any release notes label Dec 4, 2023
Copy link
Contributor

@s0me0ne-unkn0wn s0me0ne-unkn0wn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me, brings much more structure into error processing. Left a comment with my doubts, but it's not a blocker.

@@ -149,10 +154,26 @@ pub fn worker_entrypoint(
let worker_pid = process::id();
let artifact_path = worker_dir::execute_artifact(&worker_dir_path);

let Handshake { executor_params } = recv_execute_handshake(&mut stream)?;
let Handshake { executor_params } = match recv_execute_handshake(&mut stream) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this change (and the following similar changes) worth it? It's a complication, not a big one, but still, a couple of dozens of lines. Does it make things better? I believe not. Neither handshake nor request contains untrusted data, and an error would mean everything went terribly wrong. It doesn't make much sense to me to communicate that back to the host. If the communication is so much broken at this point, is it even possible to communicate it back?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point! I can revert the HostCommunication errors here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thinking about it more, these few lines can prevent disputes. And on the host-side we should also return an internal error. (That comment won't apply after the namespacing done in #2477.)

I know it's verbose but it seems like a low cost to prevent disputes. I just wonder if there's a way to make it less verbose.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like we can use map_err to save one line. :P

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, one more thought, then. The worker main loop is a closure that returns io::Result<Never> to the run_worker() function. Maybe those errors should be handled on the run_worker() level? That would still allow us to use ? in the main loop and keep error handling simple. Not sure it makes total sense, but it probably is worth considering.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, that's a great idea. Right now we quit the workers on io::Error but continue looping (most of the time) on other kinds of errors. But it should be that if any error occurs, the worker always quits. Then we could make run_worker generic over both ExecuteResult and PrepareResult.

The question is, can the worker always just quit if an error occurred?

  • execute: yes, an Err response always means that the worker dies (at least, it will after this refactor)
  • prepare: no, it’s not so clear in the un-refactored code here but we return an error if compilation fails. We still want to keep the current behavior (keeping the worker alive in that case).

I’d like to refactor the prepare errors which would enable the run_worker refactor. But it’s more work which is probably going to become obsolete soon. It would also create more merge conflicts with the other PRs. So I will drop this idea to refactor run_worker as well, although it’s a good idea and I would like to do it.

We can raise a follow-up just in case, and I'll push what I already implemented. There is some duplication of code, but this code may not need to be maintained for much longer, anyway, and it should make things a bit more robust in the meantime.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I realize that send_error wrapping the error in an io::Error is pretty weird. I thought about doing the refactor described above, but then I remembered that killing the prepare-worker on error causes the host to send a FromPool::Rip signal. This may race with the actual error sent to the host from the worker before it dies, though I'm not sure what exactly happens. (Note that there is no race in the equivalent execute-worker code, so the code we have now works - internal errors should get reported.) Clearly this also requires a deeper look, so it should also be a follow-up. It can be addressed later if some of this code remains when PolkaVM is integrated.

@mrcnski
Copy link
Contributor Author

mrcnski commented Jan 4, 2024

bot fmt

@command-bot
Copy link

command-bot bot commented Jan 4, 2024

@mrcnski https://gitlab.parity.io/parity/mirrors/polkadot-sdk/-/jobs/4837728 was started for your command "$PIPELINE_SCRIPTS_DIR/commands/fmt/fmt.sh". Check out https://gitlab.parity.io/parity/mirrors/polkadot-sdk/-/pipelines?page=1&scope=all&username=group_605_bot to know what else is being executed currently.

Comment bot cancel 1-fe7ed8fd-f6c7-478e-96fb-f6187185f5f2 to cancel this command or bot cancel to cancel all commands in this pull request.

@command-bot
Copy link

command-bot bot commented Jan 4, 2024

@mrcnski Command "$PIPELINE_SCRIPTS_DIR/commands/fmt/fmt.sh" has finished. Result: https://gitlab.parity.io/parity/mirrors/polkadot-sdk/-/jobs/4837728 has finished. If any artifacts were generated, you can download them from https://gitlab.parity.io/parity/mirrors/polkadot-sdk/-/jobs/4837728/artifacts/download.

@bkchr
Copy link
Member

bkchr commented Apr 8, 2024

@s0me0ne-unkn0wn what is the status of this?

@s0me0ne-unkn0wn
Copy link
Contributor

@bkchr this is a useful PR, but it diverged a lot from the current master and needs a lot of love to push it to merge. I wonder if any of the external contributors want to sort it out. CC @eagr @jpserrat @maksimryndin

@maksimryndin
Copy link
Contributor

maksimryndin commented Apr 9, 2024

@bkchr this is a useful PR, but it diverged a lot from the current master and needs a lot of love to push it to merge. I wonder if any of the external contributors want to sort it out. CC @eagr @jpserrat @maksimryndin

Hi @s0me0ne-unkn0wn, I can take this issue. What is the best way to handle the divergence: start a new branch and bring all the changes or try to resolve conflicts in this one?

@s0me0ne-unkn0wn
Copy link
Contributor

@maksimryndin great, very much appreciated! I'll assign it to you then. Don't forget to publish your Kusama address!

@s0me0ne-unkn0wn
Copy link
Contributor

What is the best way to handle the divergence: start a new branch and bring all the changes or try to resolve conflicts in this one?

It's up to you, I don't have a strong opinion here, it would depend on how much effort is needed to resolve those conflicts. Try it, maybe you'll find out that a new branch is a good idea :)

@maksimryndin
Copy link
Contributor

@maksimryndin great, very much appreciated! I'll assign it to you then. Don't forget to publish your Kusama address!

@s0me0ne-unkn0wn I think it is ready for review #4071. I wonder as we change an encoding of messages between the host and workers, it shouldn't be a problem as the upgrade is done both for the main binary and workers, right?

github-merge-queue bot pushed a commit that referenced this pull request Apr 19, 2024
follow up of #2604
closes #2604

- [x] take relevant changes from Marcin's PR 
- [x] extract common duplicate code for workers (low-hanging fruits)

~Some unpassed ci problems are more general and should be fixed in
master (see #4074

Proposed labels: **T0-node**, **R0-silent**, **I4-refactor**

-----

kusama address: FZXVQLqLbFV2otNXs6BMnNch54CFJ1idpWwjMb3Z8fTLQC6

---------

Co-authored-by: s0me0ne-unkn0wn <[email protected]>
@jpserrat
Copy link
Contributor

Hey @s0me0ne-unkn0wn, I was on vacation this last month 😞 . Let me know if you need help with another issue!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
R0-silent Changes should not be mentioned in any release notes T0-node This PR/Issue is related to the topic “node”.
Projects
Status: Completed
Development

Successfully merging this pull request may close these issues.

5 participants