[otbn] Add an error bit for incoming lifecycle escalations #7625

imphil · 2021-08-05T21:25:08Z

Currently, we have two registers to indicate what error happened: ERR_BITS, and FATAL_ALERT_CAUSE.

ERR_BITS is only set if OTBN is executing software.
FATAL_ALERT_CAUSE is additionally set for all fatal errors which then result in a fatal alert.

We have one interesting case: The reaction to an incoming lifecycle escalation signal is specified as "An escalation request signaled through the lc_escalate_en_i signal results in the same action as a fatal error but does not raise a fatal alert."

This wording would also imply that we don't set the FATAL_ALERT_CAUSE register (which is only specified as part of the fatal alert, not the fatal error.

I'm suggesting ~~to me~~ to also set FATAL_ALERT_CAUSE for incoming lifecycle escalations, and make the lifecycle escalation just another fatal error.

So here's my proposal:

Add a new fatal_err_lc_escalation or something like that.
(Optional) Rename FATAL_ALERT_CAUSE to FATAL_ERR_CAUSE to indicate it's about the error, not the alert.

Motivation:

Consistency: Treat an incoming lifecycle escalation in the same way as a another fatal error, instead of "it's the same except for xzy".
Testability: I'm sure we'll have a way to trigger lifecycle escalation signals for test purposes. We then want to check if OTBN got the signal by reading a status register.

Estimate for spec change and RTL implementation.

estimate 4

The text was updated successfully, but these errors were encountered:

rswarbrick · 2021-08-06T08:26:15Z

Sounds like a good plan to me.

Does the system require that we don't send an alert when we get an incoming lifecycle escalation? If not, maybe we should do that as well, which would keep everything looking a bit more consistent.

imphil · 2021-08-06T10:09:51Z

Does the system require that we don't send an alert when we get an incoming lifecycle escalation?

@msfschaffner Is there a requirement one way or the other? I.e. do you get into a strange loop if we send a fatal alert as reaction to a lifecycle escalation? (Probably not?)

msfschaffner · 2021-08-06T21:07:20Z

Thanks for bringing this up!

From a response point of view I think it does not matter much, since the system will be rendered totally non-functional upon assertion of lc_escalate_en (the life cycle state is virtually scrapped at the same time, which disables the CPU and several other things). We also have other places in the system where lc_escalate_en implicitly triggers fatal alerts due to e.g. FSMs that are moved into terminal error states.

@tjaychen @cdgori One thing we may want to think about however is the cause register "crashdump" of the alert handler. In a scenario where the system goes through all escalation phases, many alerts are going to trigger as a result of lc_escalate_en assertion, and these alerts will still be registered in the alert handler cause registers. This could possibly mask the "real" cause that led to escalation. So, should we be taking the crashdump snapshot already when entering the escalation protocol to avoid such masking?

Note that this should not be of concern when the firmware reads out the alert cause registers as a result of an NMI escalation, since at that point, lc_escalate_en has not yet been triggered.

tjaychen · 2021-08-06T23:18:50Z

o that's a good point. Yes i think you're right, we want to capture the snapshot as we start the escalation process, and not capture fall out from it.

msfschaffner · 2021-08-06T23:22:10Z

CC @cindychip

msfschaffner · 2021-08-06T23:24:10Z

Ok I made a separate issue for this.

cdgori · 2021-08-09T21:29:44Z

OK, I see the issue and agree with @tjaychen - we should capture the snapshot early on and not retake it.

I also think I agree with @imphil 's original proposal to unify the behavior of handling for fatal alerts from an OTBN perspective, unless there is something magical about the LC escalation I don't understand. (as @msfschaffner noted, the system will be very quickly rendered nonfunctional in the event of this kind of escalation)

GregAC · 2021-08-20T16:19:20Z

One thing to note is this proposal would have us setting the fatal_err_lc_escalation bit in ERR_BITS even when OTBN isn't running, if we see a lifecycle escalation. This is perfectly doable however clashes a bit with how we deal with bus integrity errors when OTBN isn't running (we leave ERR_BITS alone), the spec has this to say:

Note that OTBN can detect some errors even when it isn’t running. One example of this is an error caused by an integrity error when reading or writing OTBN’s memories over the bus. In this case, the ERR_BITS register will not change. This avoids race conditions with the host processor’s error handling software.

Indicating that we want to avoid setting ERR_BITS if OTBN isn't running.

So do we want to revise that restriction or not set fatal_err_lc_escalation in ERR_BITS if OTBN isn't active.

I guess the race condition is around the software thinking ERR_BITS only changes when OTBN finishes an operation?

@rswarbrick wrote that line 9 months ago can you remember a specific scenario we were worried about? @imphil any recollections on your side>

rswarbrick · 2021-08-20T16:45:59Z

Yep, the point is that you don't want weird race conditions for software where ERR_BITS changes underfoot. The idea was that ERR_BITS was updated synchronously at the end of an OTBN run. FATAL_ALERT_CAUSE updated asynchronously, whenever something bad happened.

tjaychen · 2021-08-20T18:43:21Z

btw @imphil @rswarbrick @GregAC
do we want to make what you're discussing here the canonical route for OT? Because blocks like keymgr, aes will eventually see something very similar.

Those blocks do not run software, but similarly they can detect errors when there is an operation running vs when it is not.

The scheme in keymgr is very similar to what you guys are describing here, ie ERR_CODE (my version of err bits) sets when there's an operation, and FAULT_STATUS (my version of FATAL_ERR_CAUSE) sets asynchronously.

I feel this is worth standardizing across the project (if not naming, then at least the behavior), since this question will undoubtedly come up again and again.

Let me know what you think. We don't have to go back and change every block now, but it would be good to setup guidance for future blocks.

tjaychen · 2021-08-20T18:52:18Z

adding @weicaiyang since we are discussing a very similar thing #7772

weicaiyang · 2021-08-20T19:20:02Z

I feel this is worth standardizing across the project (if not naming, then at least the behavior), since this question will undoubtedly come up again and again.

Agree. If we also unify the naming (both alert name and CSR status name), it will really help DV.
We have automatic tests to verify these common integrity error and shadow errors. If the names are different, we need to add a bunches of arrays for each DV owner to update the names. If behaviors are different, we need to add callbacks. We have done this for integrity error but @cindychip is still looking for a best approach to handle the shadow reg.

Before, the incoming life cycle escalation handling was specified as "like a fatal alert, except for XZY." This commit simplifies the specification to handle an incoming life cycle escalation in the same way as a fatal error that was detected within OTBN. Fixes lowRISC#7625 Signed-off-by: Philipp Wagner <[email protected]>

imphil · 2021-08-24T16:58:43Z

I've now opened PR #7897 to make the incoming life cycle escalation signal a "normal" fatal error/alert.

The normal fatal error behavior should already address the point @GregAC made: If a fatal error/alert (including an incoming life cycle escalation) is observed during an operation, ERR_BITS is set in addition to FATAL_ALERT_CAUSE and the operation is terminated in the usual way (which includes sending a "done" interrupt to unblock software if it wants to do so). If a fatal error is observed outside of an operation, only FATAL_ALERT_CAUSE is set.

On the two other things which were discussed as part of this issue:

Standardizing naming and behavior of error status registers during an operation vs. outside of an operation: I'm all for it (and also changing register names in OTBN if necessary), and opened Error status registers during an operation, and otherwise #7901 to discuss this further.
Snapshotting the escalation causes (e.g. the FATAL_ALERT_CAUSE register in OTBN) as part of a crash dump before sending out the life cycle escalation signal. [alert_handler] Capture crashdump info before going into escalation #7644 tracks progress on this issue.

Before, the incoming life cycle escalation handling was specified as "like a fatal alert, except for XZY." This commit simplifies the specification to handle an incoming life cycle escalation in the same way as a fatal error that was detected within OTBN. Fixes #7625 Signed-off-by: Philipp Wagner <[email protected]>

imphil added Component:Doc Documentation issue Component:OTBN labels Aug 5, 2021

msfschaffner mentioned this issue Aug 6, 2021

[alert_handler] Capture crashdump info before going into escalation #7644

Closed

imphil added this to the OTBN-S28 milestone Aug 9, 2021

imphil modified the milestones: OTBN-S28, OTBN-S29 Aug 23, 2021

imphil mentioned this issue Aug 24, 2021

Make the life cycle escalation a fatal alert #7897

Merged

imphil mentioned this issue Aug 24, 2021

Error status registers during an operation, and otherwise #7901

Open

imphil closed this as completed in #7897 Aug 25, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[otbn] Add an error bit for incoming lifecycle escalations #7625

[otbn] Add an error bit for incoming lifecycle escalations #7625

imphil commented Aug 5, 2021 •

edited

Loading

rswarbrick commented Aug 6, 2021 •

edited

Loading

imphil commented Aug 6, 2021

msfschaffner commented Aug 6, 2021

tjaychen commented Aug 6, 2021

msfschaffner commented Aug 6, 2021

msfschaffner commented Aug 6, 2021

cdgori commented Aug 9, 2021

GregAC commented Aug 20, 2021 •

edited

Loading

rswarbrick commented Aug 20, 2021

tjaychen commented Aug 20, 2021

tjaychen commented Aug 20, 2021

weicaiyang commented Aug 20, 2021

imphil commented Aug 24, 2021

[otbn] Add an error bit for incoming lifecycle escalations #7625

[otbn] Add an error bit for incoming lifecycle escalations #7625

Comments

imphil commented Aug 5, 2021 • edited Loading

rswarbrick commented Aug 6, 2021 • edited Loading

imphil commented Aug 6, 2021

msfschaffner commented Aug 6, 2021

tjaychen commented Aug 6, 2021

msfschaffner commented Aug 6, 2021

msfschaffner commented Aug 6, 2021

cdgori commented Aug 9, 2021

GregAC commented Aug 20, 2021 • edited Loading

rswarbrick commented Aug 20, 2021

tjaychen commented Aug 20, 2021

tjaychen commented Aug 20, 2021

weicaiyang commented Aug 20, 2021

imphil commented Aug 24, 2021

imphil commented Aug 5, 2021 •

edited

Loading

rswarbrick commented Aug 6, 2021 •

edited

Loading

GregAC commented Aug 20, 2021 •

edited

Loading