Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor cache export interface #5005

Open
tonistiigi opened this issue Jun 8, 2024 · 2 comments
Open

Refactor cache export interface #5005

tonistiigi opened this issue Jun 8, 2024 · 2 comments

Comments

@tonistiigi
Copy link
Member

When looking into #4942 #4917 (comment) I think it would be better to refactor the cache export interface.

Currently, the cache export target has the following interface:

// CacheExporterTarget defines object capable of receiving exports
type CacheExporterTarget interface {
	// Add creates a new object record that we can then add results to and
	// connect to other records.
	Add(dgst digest.Digest) CacheExporterRecord

	// Visit marks a target as having been visited.
	Visit(target any)
	// Vistited returns true if a target has previously been marked as visited.
	Visited(target any) bool
}

// CacheExporterRecord is a single object being exported
type CacheExporterRecord interface {
	AddResult(vtx digest.Digest, index int, createdAt time.Time, result *Remote)
	LinkFrom(src CacheExporterRecord, index int, selector string)
}

This has advantages of being very generic and separating the scope of different components. One can call Add any time they have some cache checksum and create new chains of cache records.

The limitations of this interface are:

  • It can't be easily cached and deduplicated. There are some attempts of this in https:/moby/buildkit/blob/v0.14.0-rc2/solver/exporter.go#L62-L80 but it is hard to be confident that it even works correctly. There is also no reuse of walking the cache chains between different build requests that share some part of the cache. This is probably especially important for backlinks that have higher performance impact because they create new cache records in boltDB.
  • Cache chains that are not exportable can not be detected early and removed from the graph to avoid any further walking of related nodes of unnecessary backlinks created for them. Such keys have the prefix random: and are taken out before exporting, but because the dependencies of the key are walked after the main key, they can only be removed in the normalization step of the cache manifest. This means we can walk a big cache branch, and once we reach the root, we find that it had a random: prefix, and that whole branch is later removed because of it. Unclear how it affects cache: ensure random prefixes are in the exported cache #4468 .
  • The normalization step https:/moby/buildkit/blob/v0.14.0-rc2/cache/remotecache/v1/chains.go#L49 is complicated and hard to follow. Again this is because new additions to graph can happen in many places and this part needs to do all the deduplication and removal of unexportable keys.

Instead I propose more specific interface:

type CacheExporterTarget interface {
	Add(dgst digest.Digest, deps [][]CacheLink, results []CacheResult) (CacheExporterRecord, bool, error)
}

// opaque interface
type CacheExporterRecord interface {}

type CacheLink struct {
	Src CacheExporterRecord
	Index int
	Selector string
}

type CacheResult struct {
	Index int
	CreatedAt time.Time
	Result *Remote
}

The main difference is that the ExportTo() function needs to call Add for all deps first and based on that, put together a [][]CacheLink array before it can call Add for the record itself. Every CacheLink needs to contain CacheExporterRecord from previous call, that otherwise is opaque object only known to the target implementation. This way it can see that if it doesn't have complete CacheLink array it can skip the whole export for that record. Add would only be called if caller has at least one valid CacheLink for each dependency. With the bool return value, target implementation can signify that it does not wish to cache such a key and no CacheExporterRecord is made for it.

The result of CacheExporterRecord can now also be cached with the CacheKey and reused without walking the full graph again. There does need to be a way to detect that no new keys were added anywhere in the subgraph by another concurrent build request, but at least if nothing changed, then there is no need to check backlinks again.

The implementation can just return the previous record it had already created if Add is called with same dgst for the CacheLink.Src that it already knows about. There shouldn't be a need to normalize, deduplicate or check for loops in the implementation like there is now.


Because this is quite a big refactor it might make sense to implement some debug tooling for comparing cache chains first. So that once the implementation is done we can compare the the result and make sure there are no unexpected regressions.

@jedevc @sipsma

@jedevc
Copy link
Member

jedevc commented Jun 10, 2024

I think this aligns with some refactoring I previously considered - the issue is that the internals of CacheExporterTarget at the moment are almost entirely non-apparent from the caller (which while "neat", makes the implementation very difficult to understand).

Removing the complex normalization is a big improvement IMO - we've previously encountered horrible issues where with enough links, the complexity of the loop removal grows exponentially.

type CacheExporterTarget interface {
	Add(dgst digest.Digest, deps [][]CacheLink, results []CacheResult) (CacheExporterRecord, bool, error)
}

I like the proposed interface, but I think having the 2d slice here is confusing. I assume the idea is that the 0th index here is for the vertex input index? If so, could we potentially group these into a wrapper struct, something like type Input struct {Links []CacheLink; Result CacheResult}, then have Add(dgst digest.Digest, inputs []Input)?

There shouldn't be a need to normalize, deduplicate or check for loops in the implementation like there is now.

So normalize before essentially did this recombining process right? Is the reason that this isn't needed anymore that we don't decompose links in the same way, so we can preserve them, so we're guaranteed to get a valid graph in the first place?

If this functionality is just used to recover the missing data, I think that works. The loops can't appear in a reasonable DAG export, and there shouldn't be duplicates. The deduplication also shouldn't happen either.

Unclear how it affects #4468

I think this would have the clear affect of never including dependent records of random: prefixes - which is really what we want (and somehow not what we seem to be getting). The main problem there is that it's totally non-apparent why we see the issue - the code is dense and difficult to understand why this isn't getting normalized away.

This shouldn't affect us in dagger, we worked around the base issue by changing certain digests to just not have the random: prefix.


implement some debug tooling for comparing cache chains first

Agreed, do you have an idea of what that would look like?


One potential suggestion/improvement to your proposal if we're in this kind of area - I'd like to see the entire removal of the magical session:/random: prefixes. I think these properties should be part of the CacheMap, probably the CacheOpts map, so they can sit next to no-cache, etc.

Maybe this already exists today with llb.WithoutExportCache? And just needs to be fixed/tested/etc.

I'd be potentially interested in taking this bit? I think it's related, but should be mostly orthogonal to the main proposal.

@tonistiigi
Copy link
Member Author

something like type Input struct {Links []CacheLink; Result CacheResult}

Should the result be for input? Can't it be for the vertex itself, so the index for CachedResult is output index, not input index.

Is the reason that this isn't needed anymore that we don't decompose links in the same way, so we can preserve them, so we're guaranteed to get a valid graph in the first place?

Yes, there is still some minimal amount of "normalization" but it should be possible to do it directly in the implementation of Add and there shouldn't be need to rescan and analyze all of the added vertexes in a separate follow-up step.

One potential suggestion/improvement to your proposal if we're in this kind of area - I'd like to see the entire removal of the magical session:/random: prefixes. I think these properties should be part of the CacheMap, probably the CacheOpts map, so they can sit next to no-cache, etc.

The logic behind not putting this in cachemap was that decisions about what cache chain should be exported should be done by the cache export backend implementation, not the LLB op implementation.

Agreed, do you have an idea of what that would look like?

I think we need to

  1. save the plaintext for the cache checksum in some debugging area so it can be looked up by digest
  2. Add ability to print the cache graph with these plaintext being debuggable. Maybe this should be simple web interface where you get graph and then can click on nodes to get the full plaintext, list of results, and buttons to move to parents of children on the step.
  3. optionally, it would be helpful to compare two (or maybe whole local boltdb) of such graphs to see where they diverge and what was the reason behind different cache checksum.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants