PVF: Refactor execute worker errors, treat more as internal #2604

mrcnski · 2023-12-04T15:29:21Z

Addresses points (1) and (2) of #2195. Recommended to review individual commits rather than the whole diff.

Instead of addressing points (3) and (4), I plan to move on after this PR. Preparation is not so security-critical, and it's also not clear how preparation will change with PolkaVM.

@eagr: Heads up, in case you're working on the error refactor.

Addresses point (1) of #2195

Address point (2) of #2195

Since the errors now derive thiserror::Error they also get `Display` automatically. 🤠

A bit unrelated to this PR, but was it worth opening a new one? 🤔

This lint doesn't work with multiple targets (in the case of prepare-worker, the bench-only dependencies were messing it up). See: - rust-lang/rust#95513 - rust-lang/rust#57274 (comment)

s0me0ne-unkn0wn

Looks good to me, brings much more structure into error processing. Left a comment with my doubts, but it's not a blocker.

s0me0ne-unkn0wn · 2024-01-03T11:45:01Z

polkadot/node/core/pvf/execute-worker/src/lib.rs

@@ -149,10 +154,26 @@ pub fn worker_entrypoint(
 let worker_pid = process::id();
 let artifact_path = worker_dir::execute_artifact(&worker_dir_path);

- let Handshake { executor_params } = recv_execute_handshake(&mut stream)?;
+ let Handshake { executor_params } = match recv_execute_handshake(&mut stream) {


Is this change (and the following similar changes) worth it? It's a complication, not a big one, but still, a couple of dozens of lines. Does it make things better? I believe not. Neither handshake nor request contains untrusted data, and an error would mean everything went terribly wrong. It doesn't make much sense to me to communicate that back to the host. If the communication is so much broken at this point, is it even possible to communicate it back?

Good point! I can revert the HostCommunication errors here.

Thinking about it more, these few lines can prevent disputes. And on the host-side we should also return an internal error. (That comment won't apply after the namespacing done in #2477.)

I know it's verbose but it seems like a low cost to prevent disputes. I just wonder if there's a way to make it less verbose.

Looks like we can use map_err to save one line. :P

Okay, one more thought, then. The worker main loop is a closure that returns io::Result<Never> to the run_worker() function. Maybe those errors should be handled on the run_worker() level? That would still allow us to use ? in the main loop and keep error handling simple. Not sure it makes total sense, but it probably is worth considering.

Thanks, that's a great idea. Right now we quit the workers on io::Error but continue looping (most of the time) on other kinds of errors. But it should be that if any error occurs, the worker always quits. Then we could make run_worker generic over both ExecuteResult and PrepareResult.

The question is, can the worker always just quit if an error occurred?

execute: yes, an Err response always means that the worker dies (at least, it will after this refactor)

prepare: no, it’s not so clear in the un-refactored code here but we return an error if compilation fails. We still want to keep the current behavior (keeping the worker alive in that case).

I’d like to refactor the prepare errors which would enable the run_worker refactor. But it’s more work which is probably going to become obsolete soon. It would also create more merge conflicts with the other PRs. So I will drop this idea to refactor run_worker as well, although it’s a good idea and I would like to do it.

We can raise a follow-up just in case, and I'll push what I already implemented. There is some duplication of code, but this code may not need to be maintained for much longer, anyway, and it should make things a bit more robust in the meantime.

I realize that send_error wrapping the error in an io::Error is pretty weird. I thought about doing the refactor described above, but then I remembered that killing the prepare-worker on error causes the host to send a FromPool::Rip signal. This may race with the actual error sent to the host from the worker before it dies, though I'm not sure what exactly happens. (Note that there is no race in the equivalent execute-worker code, so the code we have now works - internal errors should get reported.) Clearly this also requires a deeper look, so it should also be a follow-up. It can be addressed later if some of this code remains when PolkaVM is integrated.

… the worker-side

mrcnski · 2024-01-04T08:16:05Z

bot fmt

command-bot · 2024-01-04T08:16:10Z

@mrcnski https://gitlab.parity.io/parity/mirrors/polkadot-sdk/-/jobs/4837728 was started for your command "$PIPELINE_SCRIPTS_DIR/commands/fmt/fmt.sh". Check out https://gitlab.parity.io/parity/mirrors/polkadot-sdk/-/pipelines?page=1&scope=all&username=group_605_bot to know what else is being executed currently.

Comment bot cancel 1-fe7ed8fd-f6c7-478e-96fb-f6187185f5f2 to cancel this command or bot cancel to cancel all commands in this pull request.

command-bot · 2024-01-04T08:19:28Z

@mrcnski Command "$PIPELINE_SCRIPTS_DIR/commands/fmt/fmt.sh" has finished. Result: https://gitlab.parity.io/parity/mirrors/polkadot-sdk/-/jobs/4837728 has finished. If any artifacts were generated, you can download them from https://gitlab.parity.io/parity/mirrors/polkadot-sdk/-/jobs/4837728/artifacts/download.

bkchr · 2024-04-08T07:26:00Z

@s0me0ne-unkn0wn what is the status of this?

s0me0ne-unkn0wn · 2024-04-08T10:47:43Z

@bkchr this is a useful PR, but it diverged a lot from the current master and needs a lot of love to push it to merge. I wonder if any of the external contributors want to sort it out. CC @eagr @jpserrat @maksimryndin

maksimryndin · 2024-04-09T08:04:16Z

@bkchr this is a useful PR, but it diverged a lot from the current master and needs a lot of love to push it to merge. I wonder if any of the external contributors want to sort it out. CC @eagr @jpserrat @maksimryndin

Hi @s0me0ne-unkn0wn, I can take this issue. What is the best way to handle the divergence: start a new branch and bring all the changes or try to resolve conflicts in this one?

s0me0ne-unkn0wn · 2024-04-09T09:26:36Z

@maksimryndin great, very much appreciated! I'll assign it to you then. Don't forget to publish your Kusama address!

s0me0ne-unkn0wn · 2024-04-09T09:30:25Z

What is the best way to handle the divergence: start a new branch and bring all the changes or try to resolve conflicts in this one?

It's up to you, I don't have a strong opinion here, it would depend on how much effort is needed to resolve those conflicts. Try it, maybe you'll find out that a new branch is a good idea :)

maksimryndin · 2024-04-11T12:33:27Z

@maksimryndin great, very much appreciated! I'll assign it to you then. Don't forget to publish your Kusama address!

@s0me0ne-unkn0wn I think it is ready for review #4071. I wonder as we change an encoding of messages between the host and workers, it shouldn't be a problem as the upgrade is done both for the main binary and workers, right?

follow up of #2604 closes #2604 - [x] take relevant changes from Marcin's PR - [x] extract common duplicate code for workers (low-hanging fruits) ~Some unpassed ci problems are more general and should be fixed in master (see #4074 Proposed labels: **T0-node**, **R0-silent**, **I4-refactor** ----- kusama address: FZXVQLqLbFV2otNXs6BMnNch54CFJ1idpWwjMb3Z8fTLQC6 --------- Co-authored-by: s0me0ne-unkn0wn <[email protected]>

jpserrat · 2024-04-28T16:00:39Z

Hey @s0me0ne-unkn0wn, I was on vacation this last month 😞 . Let me know if you need help with another issue!

mrcnski added 2 commits December 4, 2023 15:21

PVF: Refactor execute worker errors

c8364d3

Addresses point (1) of #2195

PVF: Treat more worker errors as internal errors

122febe

Address point (2) of #2195

mrcnski added the T0-node This PR/Issue is related to the topic “node”. label Dec 4, 2023

mrcnski requested review from eskimor and s0me0ne-unkn0wn December 4, 2023 15:29

mrcnski self-assigned this Dec 4, 2023

Use Display impls for errors

3d847e3

Since the errors now derive thiserror::Error they also get `Display` automatically. 🤠

mrcnski mentioned this pull request Dec 4, 2023

PVF worker: refactor worker/job errors #2195

Open

mrcnski added 4 commits December 4, 2023 16:49

Fix build error

0fee7b0

Add/fix some lints

4290783

A bit unrelated to this PR, but was it worth opening a new one? 🤔

remove lint (broken in prepare-worker)

0693366

This lint doesn't work with multiple targets (in the case of prepare-worker, the bench-only dependencies were messing it up). See: - rust-lang/rust#95513 - rust-lang/rust#57274 (comment)

taplo format

063e4f5

mrcnski added the R0-silent Changes should not be mentioned in any release notes label Dec 4, 2023

Merge branch 'master' into mrcnski/pvf-refactor-execute-worker-errors

cc18f36

mrcnski requested a review from alindima December 13, 2023 15:20

mrcnski added 2 commits January 2, 2024 19:46

Merge branch 'master' into mrcnski/pvf-refactor-execute-worker-errors

69e163b

Fix compile error

6d21363

s0me0ne-unkn0wn approved these changes Jan 3, 2024

View reviewed changes

mrcnski added 2 commits January 3, 2024 15:36

Don't check host communication errors, simply exit worker

f2df97e

error handling less verbose; quit exec worker on error; log errors on…

7b1870c

… the worker-side

".git/.scripts/commands/fmt/fmt.sh"

385d7f9

s0me0ne-unkn0wn mentioned this pull request Jan 5, 2024

Disputes raised due to RuntimeConstruction error in pvf execution #2863

Closed

s0me0ne-unkn0wn assigned s0me0ne-unkn0wn and unassigned mrcnski Jan 25, 2024

s0me0ne-unkn0wn assigned maksimryndin Apr 9, 2024

maksimryndin mentioned this pull request Apr 10, 2024

Pvf refactor execute worker errors follow up #4071

Merged

2 tasks

s0me0ne-unkn0wn closed this in #4071 Apr 19, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PVF: Refactor execute worker errors, treat more as internal #2604

PVF: Refactor execute worker errors, treat more as internal #2604

mrcnski commented Dec 4, 2023

s0me0ne-unkn0wn left a comment

s0me0ne-unkn0wn Jan 3, 2024

mrcnski Jan 3, 2024

mrcnski Jan 3, 2024

mrcnski Jan 3, 2024

s0me0ne-unkn0wn Jan 3, 2024

mrcnski Jan 3, 2024

mrcnski Jan 4, 2024

mrcnski commented Jan 4, 2024

command-bot bot commented Jan 4, 2024 •

edited

Loading

command-bot bot commented Jan 4, 2024

bkchr commented Apr 8, 2024

s0me0ne-unkn0wn commented Apr 8, 2024

maksimryndin commented Apr 9, 2024 •

edited

Loading

s0me0ne-unkn0wn commented Apr 9, 2024

s0me0ne-unkn0wn commented Apr 9, 2024

maksimryndin commented Apr 11, 2024

jpserrat commented Apr 28, 2024

PVF: Refactor execute worker errors, treat more as internal #2604

PVF: Refactor execute worker errors, treat more as internal #2604

Conversation

mrcnski commented Dec 4, 2023

s0me0ne-unkn0wn left a comment

Choose a reason for hiding this comment

s0me0ne-unkn0wn Jan 3, 2024

Choose a reason for hiding this comment

mrcnski Jan 3, 2024

Choose a reason for hiding this comment

mrcnski Jan 3, 2024

Choose a reason for hiding this comment

mrcnski Jan 3, 2024

Choose a reason for hiding this comment

s0me0ne-unkn0wn Jan 3, 2024

Choose a reason for hiding this comment

mrcnski Jan 3, 2024

Choose a reason for hiding this comment

mrcnski Jan 4, 2024

Choose a reason for hiding this comment

mrcnski commented Jan 4, 2024

command-bot bot commented Jan 4, 2024 • edited Loading

command-bot bot commented Jan 4, 2024

bkchr commented Apr 8, 2024

s0me0ne-unkn0wn commented Apr 8, 2024

maksimryndin commented Apr 9, 2024 • edited Loading

s0me0ne-unkn0wn commented Apr 9, 2024

s0me0ne-unkn0wn commented Apr 9, 2024

maksimryndin commented Apr 11, 2024

jpserrat commented Apr 28, 2024

command-bot bot commented Jan 4, 2024 •

edited

Loading

maksimryndin commented Apr 9, 2024 •

edited

Loading