Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sporadic AV while using ApplyUpdate #50445

Closed
stephentoub opened this issue Mar 30, 2021 · 13 comments
Closed

Sporadic AV while using ApplyUpdate #50445

stephentoub opened this issue Mar 30, 2021 · 13 comments
Assignees
Milestone

Comments

@stephentoub
Copy link
Member

Repro:
RayTracerRepro.zip

  1. Ensure you have a recent nightly build installed (https:/dotnet/installer/blob/main/README.md#installers-and-binaries)... I have 6.0.100-preview.4.21179.9.
  2. Unzip this project (an old parallel loops demo app).
  3. dotnet watch
  4. Click the parallel check box (it repros without it, but not as fast or consistently)
  5. Click the Start button. The ball should start bouncing.
  6. notepad Raytracer.cs
  7. Edit L139 (return new Color(0, 0, 0);) to change one or more of the arguments to be a double value between 0 and 1 (e.g. return new Color(0, 0, .5);. Ctrl-S to save.
  8. Repeat step 7, changing the values and saving, until this is printed out by dotnet watch:
watch : Hot reload of changes succeeded.
watch : Exited with error code -1073741819
watch : Waiting for a file to change before restarting dotnet...

The AV ends up coming from the patched method:

OS Thread Id: 0x5e54
Child SP IP Call Site
000000306157D1F8 00007FFE81620D04 Microsoft.ParallelComputingPlatform.ParallelExtensions.Samples.RayTracer.TraceRay(Microsoft.ParallelComputingPlatform.ParallelExtensions.Samples.Ray, Microsoft.ParallelComputingPlatform.ParallelExtensions.Samples.Scene, Int32)
000000306157FC00 00007ffe814a2d7d [DebuggerU2MCatchHandlerFrame: 000000306157fc00]

cc: @mikem8361, @tommcdon

@stephentoub stephentoub added this to the 6.0.0 milestone Mar 30, 2021
@dotnet-issue-labeler
Copy link

I couldn't figure out the best area label to add to this issue. If you have write-permissions please help me learn by adding exactly one area label.

@dotnet-issue-labeler dotnet-issue-labeler bot added the untriaged New issue has not been triaged by the area owner label Mar 30, 2021
@ghost
Copy link

ghost commented Mar 30, 2021

Tagging subscribers to this area: @tommcdon
See info in area-owners.md if you want to be subscribed.

Issue Details

Repro:
RayTracerRepro.zip

  1. Ensure you have a recent nightly build installed (https:/dotnet/installer/blob/main/README.md#installers-and-binaries)... I have 6.0.100-preview.4.21179.9.
  2. Unzip this project (an old parallel loops demo app).
  3. dotnet watch
  4. Click the parallel check box (it repros without it, but not as fast or consistently)
  5. Click the Start button. The ball should start bouncing.
  6. notepad Raytracer.cs
  7. Edit L139 (return new Color(0, 0, 0);) to change one or more of the arguments to be a double value between 0 and 1 (e.g. return new Color(0, 0, .5);. Ctrl-S to save.
  8. Repeat step 7, changing the values and saving, until this is printed out by dotnet watch:
watch : Hot reload of changes succeeded.
watch : Exited with error code -1073741819
watch : Waiting for a file to change before restarting dotnet...

The AV ends up coming from the patched method:

OS Thread Id: 0x5e54
Child SP IP Call Site
000000306157D1F8 00007FFE81620D04 Microsoft.ParallelComputingPlatform.ParallelExtensions.Samples.RayTracer.TraceRay(Microsoft.ParallelComputingPlatform.ParallelExtensions.Samples.Ray, Microsoft.ParallelComputingPlatform.ParallelExtensions.Samples.Scene, Int32)
000000306157FC00 00007ffe814a2d7d [DebuggerU2MCatchHandlerFrame: 000000306157fc00]

cc: @mikem8361, @tommcdon

Author: stephentoub
Assignees: -
Labels:

area-Diagnostics-coreclr, untriaged

Milestone: 6.0.0

@mikem8361 mikem8361 self-assigned this Mar 30, 2021
@jkotas
Copy link
Member

jkotas commented Mar 30, 2021

I guess pMethod->Reset() in encee.cpp needs to be pMethod->ResetCodeEntryPoint() so that the reset is done atomically.

Also, delete MethodDesc::Reset if it is not used anywhere else.

@mikem8361
Copy link
Member

There is one other use: dynamicmethod.cpp line 267:

    // Reset the method desc into pristine state

    // Note: Reset has THROWS contract since it may allocate jump stub. It will never throw here
    // since it will always reuse the existing jump stub.
    pNewMD->Reset();

@mikem8361
Copy link
Member

With ResetCodeEntryPoint(), the new method with the changes doesn't seem to be used. In Stephen's demo, the background colors don't change.

@mikem8361
Copy link
Member

Adding ThreadSuspend::SuspendEE around the ApplyEditAndContinue call in ApplyUpdate seems to fix the issue or at least makes it rare. After 35 changes, it hasn't crashed with this change. Before the change, it crashed with 3 or 4 updates.

mikem8361 added a commit to mikem8361/runtime that referenced this issue Mar 30, 2021
Add thread suspend around the update API.

Issue: dotnet#50445
@jkotas
Copy link
Member

jkotas commented Mar 30, 2021

ResetCodeEntryPoint

Have you tried to step through this method to see why it does not work?

ThreadSuspend::SuspendEE around the ApplyEditAndContinue

Yes, that will make it harder to reproduce, but I do not think it would be the proper fix.

@tommcdon tommcdon removed the untriaged New issue has not been triaged by the area owner label Mar 31, 2021
@mikem8361
Copy link
Member

I'm working on stepping through ResetCodeEntryPoint but my repro using dotnet-watch make that difficult (it keeps kill off the target process I'm trying to debug). I have a very rough sketch of the calls being made in this method:

ResetCodeEntryPoint
     GetPrecode()->ResetTargetInterlocked();
           case PRECODE_FIXUP:
                 AsFixupPrecode()->ResetTargetInterlocked();

I don't know this part of the runtime (obviously) very well so I not sure what is suppose to happen. The other uses of MethodDesc::ResetCodeEntryPoint seem to do more than just call this method.

@jkotas
Copy link
Member

jkotas commented Mar 31, 2021

ResetTargetInterlocked should do the same thing as Reset, except that it should do it atomically. Reset updates the precode via multiple writes. ResetTargetInterlocked updates the precode using a single InterlockedCompareExchange. The final outcome should be identical in both cases.

@mikem8361
Copy link
Member

I stepped though both pMethod->Reset (works but crashes) and pMethod->ResetCodeEntryPoint (doesn't work) one of the differences is that in Reset this code executed:

    if (pMethod->HasNativeCodeSlot())
    {
        RelativePointer<TADDR> *pRelPtr = (RelativePointer<TADDR> *)pMethod->GetAddrOfNativeCodeSlot();
        pRelPtr->SetValueMaybeNull(NULL);
    }

Adding that after the call to ResetCodeEntryPoint causes the updated methods to be executed. It probably isn't correct either because it isn't interlocked.

UpdateMethod()
...
    pMethod->ResetCodeEntryPoint();
    if (pMethod->HasNativeCodeSlot())
    {
        RelativePointer<TADDR> *pRelPtr = (RelativePointer<TADDR> *)pMethod->GetAddrOfNativeCodeSlot();
        pRelPtr->SetValueMaybeNull(NULL);
    }

I don't understand all the details here, but could it be that ApplyUpdate is changing the method via ENC (which I think doesn't use the code version manager) and tier compilation (which does)? And yes stepping through FixupPrecode::ResetTargetInterlocked() and FixupPrecode::Init() (called from Reset()) confirm they do the same thing but it doesn't seem to be enough.

/cc @noahfalk because of his work on tier compilation.

@jkotas
Copy link
Member

jkotas commented Mar 31, 2021

Adding that after the call to ResetCodeEntryPoint causes the updated methods to be executed.

I would be ok with it as interim fix.

changing the method via ENC (which I think doesn't use the code version manager)

I guess that it is something we need to fix to make this robust and avoid intermittent crashes similar to this one.

@noahfalk
Copy link
Member

noahfalk commented Apr 1, 2021

I don't understand all the details here, but could it be that ApplyUpdate is changing the method via ENC (which I think doesn't use the code version manager) and tier compilation (which does)?

Yes, I expect that is exactly what you are seeing. Calling ResetEntryPoint() is directing the next call through the entrypoint to go into the PreStub. Once in there if the method has IsVersionable() == true then the PreStub would identify which version is now active and backpatch the precode with a pointer to that code. EnC won't take that path. I can't recall if there is a shortcut it might take by observing the NativeCodeSlot already has a valid code pointer in it, but even if it does go through all the work of jitting the updated IL, that new code will never be used because SetNativeCodeInterlocked isn't going to replace a non-null pre-existing code pointer.

jkotas pushed a commit that referenced this issue Apr 1, 2021
* Fix A/V in ApplyUpdate API

Add thread suspend around the update API.

Issue: #50445

* Remove thread suspension; switch to ResetCodeEntryPoint()
@mikem8361
Copy link
Member

I'm going to leave this issue open to address the checked build assert at

_ASSERTE(IsVersionable());

and maybe the theoretical race condition Noah mentioned.

mikem8361 added a commit to mikem8361/runtime that referenced this issue Apr 21, 2021
Added MethodDesc::ResetCodeEntryPointForEnC() that reset the entry point without
the asserts that caused the problem.

Introduced in the fix for dotnet#50445
mikem8361 added a commit that referenced this issue Apr 22, 2021
Fix Type.GetFields() after EnC/Hot reload update

Switch to the EncApproxFieldDescIterator in the GetFields() FCall.

Issue: #51517

Fix debug build assert in ApplyUpdate

Added MethodDesc::ResetCodeEntryPointForEnC() that reset the entry point without
the asserts that caused the problem.

Introduced in the fix for #50445
@ghost ghost locked as resolved and limited conversation to collaborators May 22, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

5 participants