Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[otbn] Add an error bit for incoming lifecycle escalations #7625

Closed
imphil opened this issue Aug 5, 2021 · 13 comments · Fixed by #7897
Closed

[otbn] Add an error bit for incoming lifecycle escalations #7625

imphil opened this issue Aug 5, 2021 · 13 comments · Fixed by #7897
Labels
Component:Doc Documentation issue
Milestone

Comments

@imphil
Copy link
Contributor

imphil commented Aug 5, 2021

Currently, we have two registers to indicate what error happened: ERR_BITS, and FATAL_ALERT_CAUSE.

  • ERR_BITS is only set if OTBN is executing software.
  • FATAL_ALERT_CAUSE is additionally set for all fatal errors which then result in a fatal alert.

We have one interesting case: The reaction to an incoming lifecycle escalation signal is specified as "An escalation request signaled through the lc_escalate_en_i signal results in the same action as a fatal error but does not raise a fatal alert."

This wording would also imply that we don't set the FATAL_ALERT_CAUSE register (which is only specified as part of the fatal alert, not the fatal error.

I'm suggesting to me to also set FATAL_ALERT_CAUSE for incoming lifecycle escalations, and make the lifecycle escalation just another fatal error.

So here's my proposal:

  • Add a new fatal_err_lc_escalation or something like that.
  • (Optional) Rename FATAL_ALERT_CAUSE to FATAL_ERR_CAUSE to indicate it's about the error, not the alert.

Motivation:

  • Consistency: Treat an incoming lifecycle escalation in the same way as a another fatal error, instead of "it's the same except for xzy".
  • Testability: I'm sure we'll have a way to trigger lifecycle escalation signals for test purposes. We then want to check if OTBN got the signal by reading a status register.

Estimate for spec change and RTL implementation.

estimate 4

@imphil imphil added Component:Doc Documentation issue Component:OTBN labels Aug 5, 2021
@rswarbrick
Copy link
Contributor

rswarbrick commented Aug 6, 2021

Sounds like a good plan to me.

Does the system require that we don't send an alert when we get an incoming lifecycle escalation? If not, maybe we should do that as well, which would keep everything looking a bit more consistent.

@imphil
Copy link
Contributor Author

imphil commented Aug 6, 2021

Does the system require that we don't send an alert when we get an incoming lifecycle escalation?

@msfschaffner Is there a requirement one way or the other? I.e. do you get into a strange loop if we send a fatal alert as reaction to a lifecycle escalation? (Probably not?)

@msfschaffner
Copy link
Contributor

Thanks for bringing this up!

From a response point of view I think it does not matter much, since the system will be rendered totally non-functional upon assertion of lc_escalate_en (the life cycle state is virtually scrapped at the same time, which disables the CPU and several other things). We also have other places in the system where lc_escalate_en implicitly triggers fatal alerts due to e.g. FSMs that are moved into terminal error states.

@tjaychen @cdgori One thing we may want to think about however is the cause register "crashdump" of the alert handler. In a scenario where the system goes through all escalation phases, many alerts are going to trigger as a result of lc_escalate_en assertion, and these alerts will still be registered in the alert handler cause registers. This could possibly mask the "real" cause that led to escalation. So, should we be taking the crashdump snapshot already when entering the escalation protocol to avoid such masking?

Note that this should not be of concern when the firmware reads out the alert cause registers as a result of an NMI escalation, since at that point, lc_escalate_en has not yet been triggered.

@tjaychen
Copy link

tjaychen commented Aug 6, 2021

o that's a good point. Yes i think you're right, we want to capture the snapshot as we start the escalation process, and not capture fall out from it.

@msfschaffner
Copy link
Contributor

CC @cindychip

@msfschaffner
Copy link
Contributor

Ok I made a separate issue for this.

@imphil imphil added this to the OTBN-S28 milestone Aug 9, 2021
@cdgori
Copy link

cdgori commented Aug 9, 2021

OK, I see the issue and agree with @tjaychen - we should capture the snapshot early on and not retake it.

I also think I agree with @imphil 's original proposal to unify the behavior of handling for fatal alerts from an OTBN perspective, unless there is something magical about the LC escalation I don't understand. (as @msfschaffner noted, the system will be very quickly rendered nonfunctional in the event of this kind of escalation)

@GregAC
Copy link
Contributor

GregAC commented Aug 20, 2021

One thing to note is this proposal would have us setting the fatal_err_lc_escalation bit in ERR_BITS even when OTBN isn't running, if we see a lifecycle escalation. This is perfectly doable however clashes a bit with how we deal with bus integrity errors when OTBN isn't running (we leave ERR_BITS alone), the spec has this to say:

Note that OTBN can detect some errors even when it isn’t running. One example of this is an error caused by an integrity error when reading or writing OTBN’s memories over the bus. In this case, the ERR_BITS register will not change. This avoids race conditions with the host processor’s error handling software.

Indicating that we want to avoid setting ERR_BITS if OTBN isn't running.

So do we want to revise that restriction or not set fatal_err_lc_escalation in ERR_BITS if OTBN isn't active.

I guess the race condition is around the software thinking ERR_BITS only changes when OTBN finishes an operation?

@rswarbrick wrote that line 9 months ago can you remember a specific scenario we were worried about? @imphil any recollections on your side>

@rswarbrick
Copy link
Contributor

Yep, the point is that you don't want weird race conditions for software where ERR_BITS changes underfoot. The idea was that ERR_BITS was updated synchronously at the end of an OTBN run. FATAL_ALERT_CAUSE updated asynchronously, whenever something bad happened.

@tjaychen
Copy link

btw @imphil @rswarbrick @GregAC
do we want to make what you're discussing here the canonical route for OT? Because blocks like keymgr, aes will eventually see something very similar.

Those blocks do not run software, but similarly they can detect errors when there is an operation running vs when it is not.

The scheme in keymgr is very similar to what you guys are describing here, ie ERR_CODE (my version of err bits) sets when there's an operation, and FAULT_STATUS (my version of FATAL_ERR_CAUSE) sets asynchronously.

I feel this is worth standardizing across the project (if not naming, then at least the behavior), since this question will undoubtedly come up again and again.

Let me know what you think. We don't have to go back and change every block now, but it would be good to setup guidance for future blocks.

@tjaychen
Copy link

adding @weicaiyang since we are discussing a very similar thing #7772

@weicaiyang
Copy link
Contributor

I feel this is worth standardizing across the project (if not naming, then at least the behavior), since this question will undoubtedly come up again and again.

Agree. If we also unify the naming (both alert name and CSR status name), it will really help DV.
We have automatic tests to verify these common integrity error and shadow errors. If the names are different, we need to add a bunches of arrays for each DV owner to update the names. If behaviors are different, we need to add callbacks. We have done this for integrity error but @cindychip is still looking for a best approach to handle the shadow reg.

@imphil imphil modified the milestones: OTBN-S28, OTBN-S29 Aug 23, 2021
imphil added a commit to imphil/opentitan that referenced this issue Aug 24, 2021
Before, the incoming life cycle escalation handling was specified as
"like a fatal alert, except for XZY." This commit simplifies the
specification to handle an incoming life cycle escalation in the same
way as a fatal error that was detected within OTBN.

Fixes lowRISC#7625

Signed-off-by: Philipp Wagner <[email protected]>
imphil added a commit to imphil/opentitan that referenced this issue Aug 24, 2021
Before, the incoming life cycle escalation handling was specified as
"like a fatal alert, except for XZY." This commit simplifies the
specification to handle an incoming life cycle escalation in the same
way as a fatal error that was detected within OTBN.

Fixes lowRISC#7625

Signed-off-by: Philipp Wagner <[email protected]>
@imphil
Copy link
Contributor Author

imphil commented Aug 24, 2021

I've now opened PR #7897 to make the incoming life cycle escalation signal a "normal" fatal error/alert.

The normal fatal error behavior should already address the point @GregAC made: If a fatal error/alert (including an incoming life cycle escalation) is observed during an operation, ERR_BITS is set in addition to FATAL_ALERT_CAUSE and the operation is terminated in the usual way (which includes sending a "done" interrupt to unblock software if it wants to do so). If a fatal error is observed outside of an operation, only FATAL_ALERT_CAUSE is set.

On the two other things which were discussed as part of this issue:

imphil added a commit that referenced this issue Aug 25, 2021
Before, the incoming life cycle escalation handling was specified as
"like a fatal alert, except for XZY." This commit simplifies the
specification to handle an incoming life cycle escalation in the same
way as a fatal error that was detected within OTBN.

Fixes #7625

Signed-off-by: Philipp Wagner <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Component:Doc Documentation issue
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants