Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

dockerfile2llb: filter unused paths for named contexts #4161

Merged
merged 1 commit into from
Dec 11, 2023

Conversation

jsternberg
Copy link
Collaborator

@jsternberg jsternberg commented Aug 17, 2023

This changes the llb generation in dockerfile2llb to add the
llb.FollowPaths option when named contexts are used in the same way
that happens with the default context.

This means that dockerfiles that look like this:

FROM scratch
COPY --from=mynamedctx ./a.txt /a.txt

Will only load the file a.txt. This behavior is consistent with the
default context.

This also removes the unused []llb.LocalOption from ContextOpt and
replaces it with an async counterpart. This is needed because we
initialize the named context before we know which paths are needed for
the filter.

Fixes docker/buildx#1765.

@jsternberg
Copy link
Collaborator Author

Want some early feedback on this, but leaving it as a draft as it presently only works when mynamedctx is an already defined stage that's being overwritten and doesn't work when it's defined only in the command line. I think I already know where the issue is, but I didn't originally modify that section because I wasn't quite sure what it did.

Comment on lines 835 to 808
if err == nil {
if len(cmd.sources) == 0 {
for _, src := range c.SourcePaths {
d.ctxPaths[path.Join("/", filepath.ToSlash(src))] = struct{}{}
}
} else if source := cmd.sources[0]; source.namedCtxState != nil {
for _, src := range c.SourcePaths {
source.namedCtxState.includePath(src)
}
Copy link
Member

@jedevc jedevc Aug 22, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this logic would need to be replicated to the instructions.AddCommand branch as well.

Also, ideally we could somehow just have a single combined struct that wraps both the client main context, and a named context.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this needs to be replicated to ADD. ADD doesn't seem to support the --from option. The CopyCommand has a From option while AddCommand doesn't seem to have that.

@@ -135,26 +135,86 @@ func ListTargets(ctx context.Context, dt []byte) (*targets.List, error) {
return l, nil
}

type namedContextState struct {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe something like filteredLocalState or similar?

I'm not sure about including an Image field in here, I get the logic, but it's not really connected with what paths get pulled from the client, it shouldn't affect the contents of the image data. If it gets split out, it feels like a separate refactor IMO, so if there's really a good reason to do this, it makes sense as a separate commit.

copt := nc.copt
if len(nc.includePaths) > 0 {
if includePatterns := normalizeContextPaths(nc.includePaths); includePatterns != nil {
copt.LocalOpts = append(copt.LocalOpts, llb.FollowPaths(includePatterns))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

copt.LocalOpts will get modified on the original here, which I imagine isn't the intent (given the above comment)?

copt.LocalOpts = append(copt.LocalOpts, llb.FollowPaths(includePatterns))
}
}
st, img, err := client.NamedContext(ctx, nc.name, copt)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be better to avoid multiple calls to NamedContext. Some branches in NamedContext do config resolving etc.

What if dockerui/client.NamedContext() returns a struct that has State/Image properties and SetLocalOpts([]llb.LocalOption) method. CurrentContextOpt.LocalOpts seems to be unused and can be replaced with the new method.

Calling SetLocalOpts will change the definition of State, either because the internal llb.Local is under mutableOutput or inside llb.Async(). This is only for the llb.Local definition though and doesn't rerun any other branches of NamedContext().

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that could work.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we do that, would it make sense to also return that same struct from the MainContext method as well? That way we can use the same way of updating the opts later on.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One difference is that MainContext doesn't ever return an Image (at least atm). Otherwise, I think if it is easy and unlikely to break something, we can do it now. Otherwise, we can do the namedcontext for now and look at the case for maincontext in a follow-up.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One thing I was thinking about was what if we just returned it as part of the output similar to mutableOutput? Something like:

type localOutput struct {
    name string
    opts []llb.LocalOpts
}

Then have it create the vertex in an async manner. It should then be possible to inspect the output for the mounted output and update the local options.

I think this could be coded similar to Async or possibly coded the way async is coded.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure I understand where/how that localOutput would be used.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Borrowing heavily from async:

type LocalContext struct {
	name string
	opts []llb.LocalOption
}

func (l *LocalContext) ToInput(ctx context.Context, c *llb.Constraints) (*pb.Input, error) {
	out := l.Do(ctx, c)
	return out.ToInput(ctx, c)
}

func (l *LocalContext) Vertex(ctx context.Context, c *llb.Constraints) llb.Vertex {
	out := l.Do(ctx, c)
	return out.Vertex(ctx, c)
}

func (l *LocalContext) Do(ctx context.Context, c *llb.Constraints) llb.Output {
	st := llb.Local(l.name, l.opts...)
	return st.Output()
}

func main() {
	st1 := llb.NewState(&LocalContext{
		name: "context",
	})

	st2 := st1.Dir("/src")
        // We can modify the local options by inspecting the state output.
	if lc, ok := st2.Output().(*LocalContext); ok {
		lc.opts = append(lc.opts, llb.WithCustomName("internal"))
	}

	def, err := st2.Marshal(context.TODO())
	if err != nil {
		panic(err)
	}
	_ = llb.WriteTo(def, os.Stdout)
}

The intention is that the areas where we have to modify the local context, such as this one:

if err == nil && len(cmd.sources) == 0 {
	for _, src := range c.SourcePaths {
		d.ctxPaths[path.Join("/", filepath.ToSlash(src))] = struct{}{}
	}
}

This could be modified to:

if err == nil {
	if len(cmd.sources) == 0 {
		for _, src := range c.SourcePaths {
			d.ctxPaths[path.Join("/", filepath.ToSlash(src))] = struct{}{}
		}
	} else if lc, ok := cmd.sources[0].state.Output().(*LocalContext); ok {
		// see above
	}
}

The benefit of this is we could also end up consolidating how the local options are handled so the default context and named contexts aren't different from each other.

I'll play around with the code though. I do have concerns about this being a bit too complicated to follow. Modifying the client to return a struct may end up just being easier.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Decently certain it's possible to do this using Async as long as I can modify the values the client returns.

At the moment, the only way to use async for a source is to do llb.Scratch().Async(...) and ignore the input. Are we potentially interested in adding an llb.Async top-level function? I could use it and can add it as part of this change if that's helpful.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

llb.Scratch().Async(...) and llb.Async should be equivalent. The previous state (scratch in your case) is just what gets sent to the callback as the previous state value. The callback can just ignore it.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea they're equivalent. I was wondering if we wanted to formally add an llb.Async that is a shortcut for the above.

@jsternberg
Copy link
Collaborator Author

I've updated this PR with the feedback. This now uses Async to produce the local source. Other paths aren't modified now and the named context isn't created multiple times.

Copy link
Member

@tonistiigi tonistiigi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(follow-up) Maybe after tonistiigi/fsutil#167 we can write an integration test for this as well.

)
return &st, nil, nil

// This is a weird invocation and requires an explanation.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Skip this first sentence.

The comment should more describe the behavior. Eg. "It is possible to modify the filter applied to local files later via SetLocalOpts method so the Local is wrapped in Async that only gets called during marshaling".

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this entire area can be updated to something else, I don't think this comment would be helpful. The sentence was meant to explain why new([]llb.LocalOption) but I'm going to remove that anyway.

@@ -778,6 +807,7 @@ type dispatchState struct {
buildArgs []instructions.KeyValuePairOptional
commands []command
ctxPaths map[string]struct{}
namedCtxState *namedContextState
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can avoid the namedContextState type. I would define includePaths map[string]struct{} in here and explicitly call SetLocalOpts like you had in the previous version to make it clear what is the order of gathering paths and marshaling. If you need a helper function then that is fine but I think normalizeContextPaths already does the job.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea this makes sense.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I remember why I needed to add this. The reason is I needed a place to store the returned NamedContext so I could call SetLocalOpts. If I only have includePaths in the dispatch state, it isn't able to use that information to invoke SetLocalOpts on the original NamedContext.

I could make this part of NamedContext instead of creating a new type. But, I'd have to add a method like:

func (nc *NamedContext) FollowPath(src string)

SetLocalOpts would end up being too low level. I'd probably need to then update the API if/when tonistiigi/fsutil#167 gets merged.

Since I'd need to store NamedContext in the dispatch state somewhere, I'd need to add two fields instead of one and it would be non-obvious that the purpose of includePaths was only for feeding into NamedContext and has no other purpose.

// read from the anonymous function passed to async. In order to ensure we're
// referencing the same slice, we allocate a pointer to a slice rather than just
// the slice which would be passed by value.
customOpts := new([]llb.LocalOption)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this needs to be a pointer. You can just do:

var nc NamedContext
st = llb.Async(func() { if nc.localOpts {} })
nc.State = st
return &nc

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Originally this was going to be a slice that was modified. I think that doesn't really make sense anymore so I'll switch this up. It may be good to make it a bit more explicit though that the pointer is the one being captured. I think go should figure it out automatically, but just in case it would probably be better to do nc := &NamedContext{}.

@jsternberg
Copy link
Collaborator Author

Need to do some testing but responded to the existing comments.

@jsternberg
Copy link
Collaborator Author

Did some testing and I think this is likely good to go. There's on case that I'm concerned about doing incorrectly, which is when the named context is the subject of FROM rather than referenced as a mount point, where it shouldn't do any filtering.

Before merging this, it might be beneficial to write an integration test for this that ensures the data transfer is the same between the build context and a named context. I think #4165 could be beneficial for this since it will be easier to find the number of bytes transferred.

Copy link
Member

@tonistiigi tonistiigi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With a Dockerfile like

from scratch
copy foo .
copy --from=base bar .
buildctl build --frontend dockerfile.v0 --local dockerfile=. --local context=. --local base=. --opt context:base=local:base

this currently does not work.

/tmp/a # ls -lh
total 102M
-rw-r--r--    1 root     root          47 Aug 25 05:22 Dockerfile
-rw-r--r--    1 root     root           4 Aug 25 05:21 bar
-rw-r--r--    1 root     root           3 Aug 25 05:21 foo
-rw-r--r--    1 root     root      102.0M Aug 25 05:21 rand

 => [context base] load from client  
 => => transferring base: 106.95MB

If I add FROM alpine AS base as first line that seems to correct it.

Make sure to take this out of draft as well if it is ready.

For the integration test I think tonistiigi/fsutil#167 would be better than trying to read the progress. That would allow defining context as virtual fs where the test can then see what exact files were read from it. Even if we read progress then in the test we can just access the client progress types directly and don't need to convert to json and back.

continue
}

if includePatterns := normalizeContextPaths(d.usedPaths); includePatterns != nil {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not directly for this PR but looks like the includePatterns != nil check in 571 might not be completely correct and should be len(includePatterns) != 0 . In here it doesn't matter because there is a separate len check but it could be avoided if we just do the same len check in here.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I noticed that. I'll go through and fix that. The way that Go handles empty slice and nil slices as different things isn't something I'm the biggest fan of so I usually default to using len() as the way of checking slices in general. I wanted to copy the existing code though for this situation. I'll change both to use length checking.

@@ -217,6 +213,10 @@ func toDispatchState(ctx context.Context, dt []byte, opt ConvertOpt) (*dispatchS

allDispatchStates := newDispatchStates()

// Store the mapping between a dispatchState and named contexts so we can set the local
// options later.
namedContexts := make(map[*dispatchState]*dockerui.NamedContext)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we could just define namedContext *dockerui.NamedContext in *dispatchState and avoid this extra map.

d.platform = platform
// Do not store the named context when it is being used as a base image.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But if d.base is a namedContext already then the filter should also be disabled for it.

One way to do it if you also remove the namedContexts map would be to initialize usedPaths map on setting dispatchState.namedContext and then in here if you detect that it is used as base image then set usedPaths = nil what would mean that filtering is disabled.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not completely certain if this comment is even correct or relevant. I need to check on that today. When I was attempting to test it, I ran into a panic in the scenario that I was concerned about.

Another way of doing this might be to add "/" as a used path when I detect it's being used as a base image. Setting it to nil works, but I feel like it's too much state to keep in your head and too many null pointers that could result in a panic. If we add "/" as a used path, then normalizeContextPaths will return nil and the local option won't be set.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A thing I found to work is disabling the context filtering when the associated dispatch state has commands associated with it. I figure this works well because even if an external stage is using only path /a.txt, the stage itself is using the entire base image so I just have commands disabling the filter for named context in general.

When COPY --from is used, the implicit stages that are created don't have any commands in them so it's safe to filter them.

@jsternberg jsternberg force-pushed the filtered-named-context branch 2 times, most recently from b9c1126 to c51b76e Compare September 12, 2023 19:41
@jsternberg jsternberg marked this pull request as ready for review September 12, 2023 19:41
@jsternberg jsternberg force-pushed the filtered-named-context branch 2 times, most recently from ec3a120 to 84e1591 Compare September 14, 2023 20:39
@tonistiigi
Copy link
Member

PTAL CI errors

// Propagate the set of paths from this dispatchState to the base
// dispatchState to enable filtering local contexts based on the set
// of used paths.
ds.paths = mergePaths(ds.paths, d.paths)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

isn't it a problem that this runs before dispatchRun.

I'm not sure how a path added in the dispatchRun of a child stage would affect the dispatchState of the parent that has namedContext set.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I should probably change the name of this. I didn't think of a better name yesterday.

The intention of this method is to ensure that ds.paths and d.paths refer to the same location in memory. At this point, ds.paths is probably empty but it's just making sure that everything is present in case it wasn't. This happens before dispatchRun because it ensures that all of the maps relate to the same map.

I think unifyPaths is probably a better name now that I had a night to sleep on it.

@jsternberg jsternberg force-pushed the filtered-named-context branch 3 times, most recently from 78d6031 to 286f5b9 Compare September 19, 2023 19:19
@jsternberg jsternberg marked this pull request as draft September 19, 2023 19:25
@jsternberg jsternberg force-pushed the filtered-named-context branch 6 times, most recently from 6b08069 to 250f926 Compare September 20, 2023 17:51
@jsternberg jsternberg marked this pull request as ready for review September 20, 2023 18:07
@jsternberg
Copy link
Collaborator Author

Added a test. I think this is good for review now. I would agree that the test for this would be easier if one of two things was true.

  1. I could test the underlying LLB output.
  2. I could mock the filesystem.

I saw @jedevc had a PR to get that wired up so I can return to this once that's merged. I might also try to experiment with testing underlying LLB structures.

@@ -809,12 +841,24 @@ func (dss *dispatchStates) addState(ds *dispatchState) {
if d, ok := dss.statesByName[ds.stage.BaseName]; ok {
ds.base = d
ds.outline = d.outline.clone()
// Unify the set of paths so both this state and the parent state
// refer to the same map when recording path usage.
unifyPaths(&ds.paths, d.paths)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we discussed that just set a variable in here if the extra hacks are not really needed.

@@ -876,6 +920,9 @@ func dispatchRun(d *dispatchState, c *instructions.RunCommand, proxy *llb.ProxyE

customname := c.String()

// Run command can potentially access any file. Mark the full filesystem as used.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

any command that changes LLB should need this, not just RUN. Also, if the stage is part of the target then it needs all the files.

FROM foo AS image_source
COPY --from=alpine / /
RUN cat /foo > /bar

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we add a case that also checks the alias case so that the unifyPaths gets called.

This changes the llb generation in `dockerfile2llb` to add the
`llb.FollowPaths` option when named contexts are used in the same way
that happens with the default context.

This means that dockerfiles that look like this:

```
FROM scratch
COPY --from=mynamedctx ./a.txt /a.txt
```

Will only load the file `a.txt`. This behavior is consistent with the
default context.

This also removes the unused []llb.LocalOption from ContextOpt and
replaces it with an async counterpart. This is needed because we
initialize the named context before we know which paths are needed for
the filter.

Signed-off-by: Jonathan A. Sternberg <[email protected]>
@jsternberg
Copy link
Collaborator Author

Updated this PR and it should now pass the tests.

Some changes from the original version:

  • unifyPaths was removed and the propagation of the paths was also moved. It's now in the same area where the other properties from the base image are propagated in the main loop.
  • I was unable to use llb.Async because llb.Async was triggered when a later section used AddEnv. Internally, AddEnv calls getEnv which forces the state to be initialized. When this happened, it was before the necessary section of code was run to allow Async to work properly in this context.
  • NamedContext was reverted back to returning (*llb.State, *image.Image, error). I found that returning *NamedContext created too tight of a binding between the dockerfile2llb and dockerui packages in a way that was really only relevant to one of the two packages. It ended up being easier to change LocalOpts in ContextOpt to AsyncLocalOpts and build the async logic into dockerui.
  • The target now always has / included as a necessary path.

@tonistiigi tonistiigi merged commit 4ab32be into moby:master Dec 11, 2023
60 checks passed
@jsternberg jsternberg deleted the filtered-named-context branch December 11, 2023 13:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Load just file(s) needed in build context rather than full directory tree
3 participants