Skip to content
This repository has been archived by the owner on Nov 15, 2023. It is now read-only.

Overseer: subsystems communicate directly #2227

Merged
23 commits merged into from
Mar 28, 2021
Merged

Overseer: subsystems communicate directly #2227

23 commits merged into from
Mar 28, 2021

Conversation

rphmeier
Copy link
Contributor

@rphmeier rphmeier commented Jan 8, 2021

The previous behavior was that messages would go first to the overseer and then be relayed onwards to the destination subsystem.

Now, subsystems communicate via channels directly to other subsystems. Additionally, they have the option of either sending a message through a bounded or unbounded channel. The unbounded channels should be used sparingly in situations where:

  1. There is a communication cycle between subsystems
  2. One of the parts of the cycle has a clear bound on the number of messages produced.

A few examples of places we might use send_unbounded_message are when we send locally generated backing statements to statement distribution or when we send assignments and approvals from approval voting to the approval distribution subsystem. This avoids deadlock when a network subsystem is .awaiting a send to the core subsystem, while the core subsystem sends to the networking subsystem at the same time. If buffers are full, that will deadlock. Unbounded channels avoids this situation.

SubsystemContext and the overseer itself have an invariant that when a subsystem sends a message after receiving a signal from the overseer, that the recipient of the message will not receive it until having observed the same signal. This allows subsystems to assume that other subsystems are aware of the same view of blocks as they are.

My change-set here upholds the same invariant by having each subsystem tag each message with the number of signals that it had observed before sending the message. If the recipient has not received at least that many signals, it pockets the message until it has received enough signals. This rests on the fact that the overseer sends signals to all subsystems equally. If one subsystem receives a signal, others can expect to receive it shortly.

I have also given SubsystemContext a SubsystemSender assoc. type which will be used to split sending and receiving. Follow-up PRs will make use of this feature.

@rphmeier rphmeier added A0-please_review Pull request needs code review. B0-silent Changes should not be mentioned in any release notes C1-low PR touches the given topic and has a low impact on builders. A3-in_progress Pull request is in progress. No review needed at this stage. and removed A0-please_review Pull request needs code review. A3-in_progress Pull request is in progress. No review needed at this stage. labels Jan 8, 2021
@eskimor
Copy link
Member

eskimor commented Jan 8, 2021

Our use of channels is still likely to lead to deadlocks, because we .await on both sides unconditionally. Future refactoring should alter subsystems to time-out messages to other subsystems so that progress can be made even under heavy load.

I think we could do better than this, but I need to check a few things first and think a bit more about it.

Copy link
Contributor

@coriolinus coriolinus left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Core logic in OverseerSubsystemContext::recv LGTM. I agree that this maintains the message invariant.

I'm pleasantly surprised that making this change didn't require changing any of the individual subsystems.

node/overseer/src/lib.rs Outdated Show resolved Hide resolved
node/overseer/src/lib.rs Outdated Show resolved Hide resolved
signals_received: usize,
message: AllMessages,
) {
fn make_packet<T>(timer: MaybeTimer, signals_received: usize, message: T) -> MessagePacket<T> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's too bad we don't have HKTs yet; if we did, you could simplify this into

let make_packet = for<T> |message: T| { ... }

and simply capture the values of timer and signals_received, which would simplify usage.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well we can do without, if we are sure this is the way to go, we can just expose the subsystem senders directly.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had the same thought when I was writing the code, but unfortunately no HKT yet

if res.is_err() {
tracing::debug!(
target: LOG_TARGET,
"Failed to send a message to another subsystem",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suspect that at some point we're going to want to debug which particular subsystem was the intended recipient of the failed message. Maybe an api like

impl AllMessages {
  pub fn recipient_name(&self) -> &'static str { ... }
}

would make that simple to add to this debug message.

Copy link
Member

@eskimor eskimor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks sensible to me.

Copy link
Contributor

@drahnr drahnr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Implementation wise, it looks good.

As a side effect here we lose the overseer as our probe point for messages though.

Co-authored-by: Peter Goodspeed-Niklaus <[email protected]>
@rphmeier rphmeier closed this Jan 8, 2021
@rphmeier rphmeier reopened this Mar 27, 2021
node/overseer/src/lib.rs Outdated Show resolved Hide resolved
prometheus::Opts::new(
"parachain_to_overseer_sent",
"Number of elements sent by subsystems to overseer",
"parachain_overseer_signals_sent",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

naming nit: what's the difference between parachain_subsystem and parachain_overseer prefix? They both are parameterized by subsystem_name.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

subsystem_signals_sent sounds like the subsystems are sending the signals but they are actually receiving them. overseer_unbounded_sent has the same issue in the opposite direction.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are OverseerSignals so overseer_signals_sent makes sense I think. But I could rename subsystem_bounded_sent etc.. not sure to what

node/overseer/src/lib.rs Outdated Show resolved Hide resolved
node/subsystem/src/lib.rs Outdated Show resolved Hide resolved
@rphmeier
Copy link
Contributor Author

bot merge

@ghost
Copy link

ghost commented Mar 28, 2021

Waiting for commit status.

@ghost ghost merged commit 322ccd0 into master Mar 28, 2021
@ghost ghost deleted the rh-messaging-refactor branch March 28, 2021 15:55
ordian added a commit that referenced this pull request Mar 28, 2021
* master:
  Call NetworkService::add_known_address before sending a request (#2726)
  Overseer: subsystems communicate directly (#2227)
  Request based PoV distribution (#2640)
rphmeier added a commit that referenced this pull request Mar 28, 2021
* overseer: pass messages directly between subsystems

* test that message is held on to

* Update node/overseer/src/lib.rs

Co-authored-by: Peter Goodspeed-Niklaus <[email protected]>

* give every subsystem an unbounded sender too

* remove metered_channel::name

1. we don't provide good names
2. these names are never used anywhere

* unused mut

* remove unnecessary &mut

* subsystem unbounded_send

* remove unused MaybeTimer

We have channel size metrics that serve the same purpose better now and the implementation of message timing was pretty ugly.

* remove comment

* split up senders and receivers

* update metrics

* fix tests

* fix test subsystem context

* fix flaky test

* fix docs

* doc

* use select_biased to favor signals

* Update node/subsystem/src/lib.rs

Co-authored-by: Andronik Ordian <[email protected]>

Co-authored-by: Peter Goodspeed-Niklaus <[email protected]>
Co-authored-by: Andronik Ordian <[email protected]>
This pull request was closed.
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
A0-please_review Pull request needs code review. B0-silent Changes should not be mentioned in any release notes C1-low PR touches the given topic and has a low impact on builders.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants