Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add backtracking stack imbalance detection to regex source generator #98472

Merged
merged 5 commits into from
Feb 24, 2024

Conversation

stephentoub
Copy link
Member

Backtracking regex constructs use a stack to track state. Information about the current state of a construct (e.g. loop iteration number, alternation branch, etc.) is pushed onto the stack when the construct matches, and then if backtracking occurs and execution unwinds back to that construct, it pops state off of the stack in order to restore its knowledge of what it had done and what needs to happen next. We've had a couple of bugs over the years where this process goes awry and something incorrectly leaves extra state on the stack, which in turn makes it so that when the state is unwound, a previous construct ends up popping the wrong state off the stack, which it then misinterprets. This can lead to subtle bugs (in some cases it doesn't end up impacting matching against a particular input, in others it results in exceptions, etc.).

To help reduce the chances that such issues remain, this adds some additional debug-only code to the source generator (I didn't modify the compiler just because it's a bit more challenging there, and the source generator is what's typically used when debugging these issues). Whenever pushing state onto the stack, it also pushes a cookie. When it pops state off, it also pops off that cookie, and validates that it's the expected value.

This actually found one more lurking case of this in the code, which this PR also fixes. Conditional expressions (a rarely used feature) are supposed to treat the condition as a positive lookaround, which means it should be atomic, but if the expression contained a backtracking construct, that state was erroneously being left on the stack rather than being removed as part of wiping away the atomic expression.

Backtracking regex constructs use a stack to track state. Information about the current state of a construct (e.g. loop iteration number, alternation branch, etc.) is pushed onto the stack when the construct matches, and then if backtracking occurs and execution unwinds back to that construct, it pops state off of the stack in order to restore its knowledge of what it had done and what needs to happen next. We've had a couple of bugs over the years where this process goes awry and something incorrectly leaves extra state on the stack, which in turn makes it so that when the state is unwound, a previous construct ends up popping the wrong state off the stack, which it then misinterprets.  This can lead to subtle bugs (in some cases it doesn't end up impacting matching against a particular input, in others it results in exceptions, etc.).

To help reduce the chances that such issues remain, this adds some additional debug-only code to the source generator (I didn't modify the compiler just because it's a bit more challenging there, and the source generator is what's typically used when debugging these issues). Whenever pushing state onto the stack, it also pushes a cookie. When it pops state off, it also pops off that cookie, and validates that it's the expected value.

This actually found one more lurking case of this in the code, which this PR also fixes. Conditional expressions (a rarely used feature) are supposed to treat the condition as a positive lookaround, which means it should be atomic, but if the expression contained a backtracking construct, that state was erroneously being left on the stack rather than being removed as part of wiping away the atomic expression.
@ghost
Copy link

ghost commented Feb 15, 2024

Tagging subscribers to this area: @dotnet/area-system-text-regularexpressions
See info in area-owners.md if you want to be subscribed.

Issue Details

Backtracking regex constructs use a stack to track state. Information about the current state of a construct (e.g. loop iteration number, alternation branch, etc.) is pushed onto the stack when the construct matches, and then if backtracking occurs and execution unwinds back to that construct, it pops state off of the stack in order to restore its knowledge of what it had done and what needs to happen next. We've had a couple of bugs over the years where this process goes awry and something incorrectly leaves extra state on the stack, which in turn makes it so that when the state is unwound, a previous construct ends up popping the wrong state off the stack, which it then misinterprets. This can lead to subtle bugs (in some cases it doesn't end up impacting matching against a particular input, in others it results in exceptions, etc.).

To help reduce the chances that such issues remain, this adds some additional debug-only code to the source generator (I didn't modify the compiler just because it's a bit more challenging there, and the source generator is what's typically used when debugging these issues). Whenever pushing state onto the stack, it also pushes a cookie. When it pops state off, it also pops off that cookie, and validates that it's the expected value.

This actually found one more lurking case of this in the code, which this PR also fixes. Conditional expressions (a rarely used feature) are supposed to treat the condition as a positive lookaround, which means it should be atomic, but if the expression contained a backtracking construct, that state was erroneously being left on the stack rather than being removed as part of wiping away the atomic expression.

Author: stephentoub
Assignees: -
Labels:

area-System.Text.RegularExpressions

Milestone: -

@stephentoub
Copy link
Member Author

@dotnet/area-system-text-regularexpressions, could you help review this?

Copy link
Member

@joperezr joperezr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is great, thanks Steve!

@stephentoub stephentoub merged commit f32c428 into dotnet:main Feb 24, 2024
109 of 111 checks passed
@stephentoub stephentoub deleted the regexstackimbalance branch February 24, 2024 10:27
@github-actions github-actions bot locked and limited conversation to collaborators Mar 26, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants