Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

solver: fix issue with double merged edges #4285

Merged
merged 1 commit into from
Oct 23, 2023

Conversation

tonistiigi
Copy link
Member

@tonistiigi tonistiigi commented Sep 28, 2023

fix #4278

"Merged edges" is an optimization in the solve graph where two different active LLB edges can be combined into one after cache key computation finds that they generated equivalent cache chain.

In that case, one edge is released and is set to point to another. The reference count for the second one is increased.

An issue was discovered where an edge that was already pointing to another edge became a target to the third one. The current implementation did not handle the case where an edge that already had a different target itself became a target edge as well. This resulted in an inconsistent graph state where edges that were thought to be released could get scheduled again.

Instead of setting the same edge value to two different maps, the new logic is to chain the edges into a linked list that should handle multiple levels of targets. This slightly increases the number of lookups, but "merged edges" case is very rare anyway, and a couple of extra pointer dereferences do not affect build speed.

"Merged edges" is an optimization in the solve graph where two different active LLB edges can be combined into one after cache key computation finds that they generated equivalent cache chain.

In that case, one edge is released and is set to point to another. The reference count for the second one is increased.

An issue was discovered where an edge that was already pointing to another edge became a target to the third one. The current implementation did not handle the case where an edge that already had a different target itself became a target edge as well. This resulted in an inconsistent graph state where edges that were thought to be released could get scheduled again.

Instead of setting the same edge value to two different maps, the new logic is to chain the edges into a linked list that should handle multiple levels of targets. This slightly increases the number of lookups, but "merged edges" case is very rare anyway, and a couple of extra pointer dereferences do not affect build speed.

Signed-off-by: Tonis Tiigi <[email protected]>
@tonistiigi tonistiigi marked this pull request as ready for review September 28, 2023 01:03
Copy link
Collaborator

@sipsma sipsma left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I get the problem and the general idea of the solution, but I'd be lying if I said I was confident in my understanding of the overall code that is connected to the changes here 🙂

Along those lines, this feels like it would greatly benefit from unit tests if at all possible.

func (e *edge) takeOwnership(old *edge) {
e.releaserCount += old.releaserCount + 1
old.owner = e
old.releaseResult()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Question for my own understanding: in scheduler, there's loop which calls dispatch on edges. Once we've done a merge from old edge -> target edge, will that mean that dispatch is only ever called on target edge (or if target itself has owners, the "ultimate" owner edge after following the linked list) and never be called on old edge anymore?

Just wondering because, if true, that would massively simply trying to understand the overall logic. I was initially concerned that we make this old.releaseResult() call but then there's still a ton of methods on edge that access e.result without checking whether e.owner != nil, which felt like an avenue to accidentally using an already released result.

But, if we only ever call dispatch on edges with e.owner == nil, I think most of those concerns go away since all those other methods should never be called once e.owner != nil.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the "ultimate" owner edge after following the linked list) and never be called on old edge anymore?

Yes, the scheduler only accesses edges via getEdge() method, and I have added owner check into that method in this PR so that once the owner is set, it is returned instead of the old definition.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool, thanks, that helps.

I really think we should add one or more unit tests for this case if at all possible. Partially for regression but also partially because the unit tests are pretty helpful for understanding the expected behavior too.

Other than that LGTM

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure how tricky it is to repro this with unit test. You need to synchronize the requests to not only cause the triple edge merge, but you need to be able to trigger the evaluation of the second one only after the triple merge has already happened.

@alexcb Maybe your repro can be turned into an integration test?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can definitely give it a try -- I'll use

func TestMergedEdgesLookup(t *testing.T) {
as inspiration.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

still no luck triggering this via a test; I've been hacking at this function in a branch based off master (i.e. without this fix)

here's what I have so far: https:/alexcb/buildkit/blob/507cc41e4aa901d229e1fa97d65de998120142ff/solver/scheduler_test.go#L3081-L3163 (with the intention of moving it to a new function if I succeed in reproducing it)

I've added in printfs to the scheduler to trace the order needed to reproduce the problem. Here's an example of the error: https:/alexcb/buildkit/blob/507cc41e4aa901d229e1fa97d65de998120142ff/prints-from-reproduction-program

but unfortunately my test isn't triggering the same code path -- one noticable difference is in the test, I only see newRequestWithFunc called for ... once per edge, when I believe I need it to occur twice in order to trigger the bad merge.

I'll keep hacking at the test, but wanted to share an update in case there's any obvious parts my test is skipping.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was able to spend some more time with the scheduler_test.go code -- it's pretty neat.

I created some functions to make it easier to wait on cacheKey() returning, and am able to create some edge merges more easily, but I still haven't been able to reproduce the multi edge merge.

here's the updated attempt: https:/alexcb/buildkit/blob/e6da21940e81a000a56c06a59664bd9ab28ea480/solver/scheduler_test.go#L3123

I'm running out of steam on it though, and am missing something timing related I suspect.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the problem is that you don't only need to create the merge case, but there also needs to be a vertex that depends on the edge that gets merged, and that edge needs to go to the completed state only after the merge has happened. Merging itself does not trigger the bug. What triggers the bug is a new request coming into the vertex that already had the 3-level merge and should never unpark again.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, while digging into #4347 (comment), I found something that seems to contradict something here.

the "ultimate" owner edge after following the linked list) and never be called on old edge anymore?

Yes, the scheduler only accesses edges via getEdge() method, and I have added owner check into that method in this PR so that once the owner is set, it is returned instead of the old definition.

My understanding of this is that dispatch should never be called with an *edge that has an owner? However, I do seem to see this in logs (using my little investigation branch from my linked comment master...jedevc:buildkit:wip-repro-the-deadlock):

>>> sha256:614be6d44883c15260f9ce29e2feb9fc4f83bf8467744434f2e6152a747ab383
    sha256:614be6d44883c15260f9ce29e2feb9fc4f83bf8467744434f2e6152a747ab383 owned by sha256:ccb9d6e0f552230dba1da5f989d813984ddb3e5a44cf674e5df1c0e4524f4463

Is my understanding wrong here? Is owner on a edge a valid input to dispatch? And if it is, we should panic if we detect it, since some internal assumption seems to be violated then.

Copy link
Collaborator

@sipsma sipsma left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like a test may be pretty tough in practice, so don't want to block this on that

alexcb added a commit to earthly/earthly that referenced this pull request Oct 19, 2023
cherry-picked 100d3cb6b6903be50f7a3e5dba193515aa9530fa from
moby/buildkit#4285

fixes #2957

Signed-off-by: Alex Couture-Beil <[email protected]>
alexcb added a commit to earthly/earthly that referenced this pull request Oct 20, 2023
cherry-picked 100d3cb6b6903be50f7a3e5dba193515aa9530fa from
moby/buildkit#4285

fixes #2957

Signed-off-by: Alex Couture-Beil <[email protected]>
@tonistiigi
Copy link
Member Author

Going to bring this in. If we can get a test then post a follow-up.

@tonistiigi tonistiigi merged commit 20847b7 into moby:master Oct 23, 2023
56 checks passed
@jedevc
Copy link
Member

jedevc commented Jan 19, 2024

I looked at writing a test that can repro this and got some way to it:

func TestMergedEdgesFoo(t *testing.T) {
	t.Parallel()

	for i := 0; i < 1; i++ {
		ctx := context.TODO()

		cacheManager := newTrackingCacheManager(NewInMemoryCacheManager())

		l := NewSolver(SolverOpt{
			ResolveOpFunc: testOpResolver,
			DefaultCache:  cacheManager,
		})
		defer l.Close()

		j0, err := l.NewJob("j0")
		require.NoError(t, err)

		defer func() {
			if j0 != nil {
				j0.Discard()
			}
		}()

		va := vtxAdd(2, vtxOpt{name: "va", inputs: []Edge{
			{Vertex: vtxConst(3, vtxOpt{})},
			{Vertex: vtxConst(4, vtxOpt{})},
		}})
		vb := vtxAdd(2, vtxOpt{name: "vb", inputs: []Edge{
			{Vertex: vtxConst(3, vtxOpt{})},
			{Vertex: vtxConst(4, vtxOpt{})},
		}})
		vc := vtxAdd(2, vtxOpt{name: "vc", inputs: []Edge{
			{Vertex: vtxConst(3, vtxOpt{})},
			{Vertex: vtxConst(4, vtxOpt{})},
		}})

		g0 := Edge{
			Vertex: vtxSum(1, vtxOpt{inputs: []Edge{
				{Vertex: va, Index: 0},
				{Vertex: vb, Index: 0},

				{Vertex: vb, Index: 1},
				{Vertex: vc, Index: 1},
			}}),
		}
		g0.Vertex.(*vertexSum).setupCallCounters()

		res, err := j0.Build(ctx, g0)
		require.NoError(t, err)
		_ = res

		require.NoError(t, j0.Discard())
		j0 = nil
	}
}

I can get the 3-level merge:

BUILDKIT_SCHEDULER_DEBUG=1 go test ./solver --run TestMergedEdgesFoo -v | grep merging
time="2024-01-19T11:05:59Z" level=debug msg="merging edge const-3-s62ijeynyh0qamyxctsggk0p9 to const-3-hj0wj2wvpr9pxss71hox0xogf\n"
time="2024-01-19T11:05:59Z" level=debug msg="merging edge const-4-3ujh8mx4nc9f56wq6pkgx7hd8 to const-4-9xp237wkcxg5hidh1a3ruvsq9\n"
time="2024-01-19T11:05:59Z" level=debug msg="merging edge const-3-vnynvuegh5g4j8hpe47yk03k6 to const-3-hj0wj2wvpr9pxss71hox0xogf\n"
time="2024-01-19T11:05:59Z" level=debug msg="merging edge const-4-pwmpxjt0ts5henrsm4hafcge4 to const-4-9xp237wkcxg5hidh1a3ruvsq9\n"
time="2024-01-19T11:05:59Z" level=debug msg="merging edge vb to va\n"
time="2024-01-19T11:05:59Z" level=debug msg="merging edge vc to vb\n"

After this, vc->vb->va: but this test succeeds fine. Not quite sure if there's something else I need to do, is this triggered by releasing/discarding results in a specific order?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
4 participants