solver: fix issue with double merged edges #4285

tonistiigi · 2023-09-28T00:22:32Z

"Merged edges" is an optimization in the solve graph where two different active LLB edges can be combined into one after cache key computation finds that they generated equivalent cache chain.

In that case, one edge is released and is set to point to another. The reference count for the second one is increased.

An issue was discovered where an edge that was already pointing to another edge became a target to the third one. The current implementation did not handle the case where an edge that already had a different target itself became a target edge as well. This resulted in an inconsistent graph state where edges that were thought to be released could get scheduled again.

Instead of setting the same edge value to two different maps, the new logic is to chain the edges into a linked list that should handle multiple levels of targets. This slightly increases the number of lookups, but "merged edges" case is very rare anyway, and a couple of extra pointer dereferences do not affect build speed.

"Merged edges" is an optimization in the solve graph where two different active LLB edges can be combined into one after cache key computation finds that they generated equivalent cache chain. In that case, one edge is released and is set to point to another. The reference count for the second one is increased. An issue was discovered where an edge that was already pointing to another edge became a target to the third one. The current implementation did not handle the case where an edge that already had a different target itself became a target edge as well. This resulted in an inconsistent graph state where edges that were thought to be released could get scheduled again. Instead of setting the same edge value to two different maps, the new logic is to chain the edges into a linked list that should handle multiple levels of targets. This slightly increases the number of lookups, but "merged edges" case is very rare anyway, and a couple of extra pointer dereferences do not affect build speed. Signed-off-by: Tonis Tiigi <[email protected]>

sipsma

I get the problem and the general idea of the solution, but I'd be lying if I said I was confident in my understanding of the overall code that is connected to the changes here 🙂

Along those lines, this feels like it would greatly benefit from unit tests if at all possible.

sipsma · 2023-10-11T19:09:37Z

solver/edge.go

+func (e *edge) takeOwnership(old *edge) {
+ e.releaserCount += old.releaserCount + 1
+ old.owner = e
+ old.releaseResult()


Question for my own understanding: in scheduler, there's loop which calls dispatch on edges. Once we've done a merge from old edge -> target edge, will that mean that dispatch is only ever called on target edge (or if target itself has owners, the "ultimate" owner edge after following the linked list) and never be called on old edge anymore?

Just wondering because, if true, that would massively simply trying to understand the overall logic. I was initially concerned that we make this old.releaseResult() call but then there's still a ton of methods on edge that access e.result without checking whether e.owner != nil, which felt like an avenue to accidentally using an already released result.

But, if we only ever call dispatch on edges with e.owner == nil, I think most of those concerns go away since all those other methods should never be called once e.owner != nil.

the "ultimate" owner edge after following the linked list) and never be called on old edge anymore?

Yes, the scheduler only accesses edges via getEdge() method, and I have added owner check into that method in this PR so that once the owner is set, it is returned instead of the old definition.

Cool, thanks, that helps.

I really think we should add one or more unit tests for this case if at all possible. Partially for regression but also partially because the unit tests are pretty helpful for understanding the expected behavior too.

Other than that LGTM

I'm not sure how tricky it is to repro this with unit test. You need to synchronize the requests to not only cause the triple edge merge, but you need to be able to trigger the evaluation of the second one only after the triple merge has already happened.

@alexcb Maybe your repro can be turned into an integration test?

I can definitely give it a try -- I'll use

buildkit/solver/scheduler_test.go

Line 3040 in 86e25b3

func TestMergedEdgesLookup(t *testing.T) {

as inspiration.

still no luck triggering this via a test; I've been hacking at this function in a branch based off master (i.e. without this fix)

here's what I have so far: https:/alexcb/buildkit/blob/507cc41e4aa901d229e1fa97d65de998120142ff/solver/scheduler_test.go#L3081-L3163 (with the intention of moving it to a new function if I succeed in reproducing it)

I've added in printfs to the scheduler to trace the order needed to reproduce the problem. Here's an example of the error: https:/alexcb/buildkit/blob/507cc41e4aa901d229e1fa97d65de998120142ff/prints-from-reproduction-program

but unfortunately my test isn't triggering the same code path -- one noticable difference is in the test, I only see newRequestWithFunc called for ... once per edge, when I believe I need it to occur twice in order to trigger the bad merge.

I'll keep hacking at the test, but wanted to share an update in case there's any obvious parts my test is skipping.

I was able to spend some more time with the scheduler_test.go code -- it's pretty neat.

I created some functions to make it easier to wait on cacheKey() returning, and am able to create some edge merges more easily, but I still haven't been able to reproduce the multi edge merge.

here's the updated attempt: https:/alexcb/buildkit/blob/e6da21940e81a000a56c06a59664bd9ab28ea480/solver/scheduler_test.go#L3123

I'm running out of steam on it though, and am missing something timing related I suspect.

I think the problem is that you don't only need to create the merge case, but there also needs to be a vertex that depends on the edge that gets merged, and that edge needs to go to the completed state only after the merge has happened. Merging itself does not trigger the bug. What triggers the bug is a new request coming into the vertex that already had the 3-level merge and should never unpark again.

Ok, while digging into #4347 (comment), I found something that seems to contradict something here.

the "ultimate" owner edge after following the linked list) and never be called on old edge anymore?

Yes, the scheduler only accesses edges via getEdge() method, and I have added owner check into that method in this PR so that once the owner is set, it is returned instead of the old definition.

My understanding of this is that dispatch should never be called with an *edge that has an owner? However, I do seem to see this in logs (using my little investigation branch from my linked comment master...jedevc:buildkit:wip-repro-the-deadlock):

>>> sha256:614be6d44883c15260f9ce29e2feb9fc4f83bf8467744434f2e6152a747ab383 sha256:614be6d44883c15260f9ce29e2feb9fc4f83bf8467744434f2e6152a747ab383 owned by sha256:ccb9d6e0f552230dba1da5f989d813984ddb3e5a44cf674e5df1c0e4524f4463

Is my understanding wrong here? Is owner on a edge a valid input to dispatch? And if it is, we should panic if we detect it, since some internal assumption seems to be violated then.

sipsma

Seems like a test may be pretty tough in practice, so don't want to block this on that

cherry-picked 100d3cb6b6903be50f7a3e5dba193515aa9530fa from moby/buildkit#4285 fixes #2957 Signed-off-by: Alex Couture-Beil <[email protected]>

tonistiigi · 2023-10-23T22:01:56Z

Going to bring this in. If we can get a test then post a follow-up.

jedevc · 2024-01-19T11:08:03Z

I looked at writing a test that can repro this and got some way to it:

func TestMergedEdgesFoo(t *testing.T) {
	t.Parallel()

	for i := 0; i < 1; i++ {
		ctx := context.TODO()

		cacheManager := newTrackingCacheManager(NewInMemoryCacheManager())

		l := NewSolver(SolverOpt{
			ResolveOpFunc: testOpResolver,
			DefaultCache:  cacheManager,
		})
		defer l.Close()

		j0, err := l.NewJob("j0")
		require.NoError(t, err)

		defer func() {
			if j0 != nil {
				j0.Discard()
			}
		}()

		va := vtxAdd(2, vtxOpt{name: "va", inputs: []Edge{
			{Vertex: vtxConst(3, vtxOpt{})},
			{Vertex: vtxConst(4, vtxOpt{})},
		}})
		vb := vtxAdd(2, vtxOpt{name: "vb", inputs: []Edge{
			{Vertex: vtxConst(3, vtxOpt{})},
			{Vertex: vtxConst(4, vtxOpt{})},
		}})
		vc := vtxAdd(2, vtxOpt{name: "vc", inputs: []Edge{
			{Vertex: vtxConst(3, vtxOpt{})},
			{Vertex: vtxConst(4, vtxOpt{})},
		}})

		g0 := Edge{
			Vertex: vtxSum(1, vtxOpt{inputs: []Edge{
				{Vertex: va, Index: 0},
				{Vertex: vb, Index: 0},

				{Vertex: vb, Index: 1},
				{Vertex: vc, Index: 1},
			}}),
		}
		g0.Vertex.(*vertexSum).setupCallCounters()

		res, err := j0.Build(ctx, g0)
		require.NoError(t, err)
		_ = res

		require.NoError(t, j0.Discard())
		j0 = nil
	}
}

I can get the 3-level merge:

BUILDKIT_SCHEDULER_DEBUG=1 go test ./solver --run TestMergedEdgesFoo -v | grep merging
time="2024-01-19T11:05:59Z" level=debug msg="merging edge const-3-s62ijeynyh0qamyxctsggk0p9 to const-3-hj0wj2wvpr9pxss71hox0xogf\n"
time="2024-01-19T11:05:59Z" level=debug msg="merging edge const-4-3ujh8mx4nc9f56wq6pkgx7hd8 to const-4-9xp237wkcxg5hidh1a3ruvsq9\n"
time="2024-01-19T11:05:59Z" level=debug msg="merging edge const-3-vnynvuegh5g4j8hpe47yk03k6 to const-3-hj0wj2wvpr9pxss71hox0xogf\n"
time="2024-01-19T11:05:59Z" level=debug msg="merging edge const-4-pwmpxjt0ts5henrsm4hafcge4 to const-4-9xp237wkcxg5hidh1a3ruvsq9\n"
time="2024-01-19T11:05:59Z" level=debug msg="merging edge vb to va\n"
time="2024-01-19T11:05:59Z" level=debug msg="merging edge vc to vb\n"

After this, vc->vb->va: but this test succeeds fine. Not quite sure if there's something else I need to do, is this triggered by releasing/discarding results in a specific order?

tonistiigi force-pushed the double-merge-edge-fix branch from e427e85 to b86d50c Compare September 28, 2023 00:34

tonistiigi force-pushed the double-merge-edge-fix branch from b86d50c to 100d3cb Compare September 28, 2023 00:35

tonistiigi marked this pull request as ready for review September 28, 2023 01:03

tonistiigi requested a review from sipsma October 10, 2023 21:53

sipsma reviewed Oct 11, 2023

View reviewed changes

tonistiigi mentioned this pull request Oct 18, 2023

solver: fix printing progress messages after merged edges #4347

Merged

sipsma approved these changes Oct 19, 2023

View reviewed changes

alexcb added a commit to earthly/earthly that referenced this pull request Oct 19, 2023

update buildkit with double-merged-edge fix

59c5243

cherry-picked 100d3cb6b6903be50f7a3e5dba193515aa9530fa from moby/buildkit#4285 fixes #2957 Signed-off-by: Alex Couture-Beil <[email protected]>

alexcb mentioned this pull request Oct 19, 2023

update buildkit with double-merged-edge fix earthly/earthly#3410

Merged

alexcb added a commit to earthly/earthly that referenced this pull request Oct 20, 2023

update buildkit with double-merged-edge fix (#3410)

2a9f2b1

cherry-picked 100d3cb6b6903be50f7a3e5dba193515aa9530fa from moby/buildkit#4285 fixes #2957 Signed-off-by: Alex Couture-Beil <[email protected]>

jedevc approved these changes Oct 23, 2023

View reviewed changes

tonistiigi merged commit 20847b7 into moby:master Oct 23, 2023
56 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

solver: fix issue with double merged edges #4285

solver: fix issue with double merged edges #4285

tonistiigi commented Sep 28, 2023 •

edited

Loading

sipsma left a comment

sipsma Oct 11, 2023

tonistiigi Oct 11, 2023

sipsma Oct 12, 2023

tonistiigi Oct 17, 2023

alexcb Oct 17, 2023

alexcb Oct 19, 2023

alexcb Oct 20, 2023

tonistiigi Oct 20, 2023

jedevc Jan 12, 2024

sipsma left a comment

tonistiigi commented Oct 23, 2023

jedevc commented Jan 19, 2024

solver: fix issue with double merged edges #4285

solver: fix issue with double merged edges #4285

Conversation

tonistiigi commented Sep 28, 2023 • edited Loading

sipsma left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sipsma left a comment

Choose a reason for hiding this comment

tonistiigi commented Oct 23, 2023

jedevc commented Jan 19, 2024

tonistiigi commented Sep 28, 2023 •

edited

Loading