-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Refactor cache export interface #5005
Comments
I think this aligns with some refactoring I previously considered - the issue is that the internals of Removing the complex normalization is a big improvement IMO - we've previously encountered horrible issues where with enough links, the complexity of the loop removal grows exponentially.
I like the proposed interface, but I think having the 2d slice here is confusing. I assume the idea is that the 0th index here is for the vertex input index? If so, could we potentially group these into a wrapper struct, something like
So If this functionality is just used to recover the missing data, I think that works. The loops can't appear in a reasonable DAG export, and there shouldn't be duplicates. The deduplication also shouldn't happen either.
I think this would have the clear affect of never including dependent records of This shouldn't affect us in dagger, we worked around the base issue by changing certain digests to just not have the
Agreed, do you have an idea of what that would look like? One potential suggestion/improvement to your proposal if we're in this kind of area - I'd like to see the entire removal of the magical Maybe this already exists today with I'd be potentially interested in taking this bit? I think it's related, but should be mostly orthogonal to the main proposal. |
Should the result be for input? Can't it be for the vertex itself, so the index for
Yes, there is still some minimal amount of "normalization" but it should be possible to do it directly in the implementation of
The logic behind not putting this in cachemap was that decisions about what cache chain should be exported should be done by the cache export backend implementation, not the LLB op implementation.
I think we need to
|
When looking into #4942 #4917 (comment) I think it would be better to refactor the cache export interface.
Currently, the cache export target has the following interface:
This has advantages of being very generic and separating the scope of different components. One can call
Add
any time they have some cache checksum and create new chains of cache records.The limitations of this interface are:
random:
and are taken out before exporting, but because the dependencies of the key are walked after the main key, they can only be removed in the normalization step of the cache manifest. This means we can walk a big cache branch, and once we reach the root, we find that it had arandom:
prefix, and that whole branch is later removed because of it. Unclear how it affects cache: ensure random prefixes are in the exported cache #4468 .Instead I propose more specific interface:
The main difference is that the
ExportTo()
function needs to callAdd
for all deps first and based on that, put together a[][]CacheLink
array before it can callAdd
for the record itself. EveryCacheLink
needs to containCacheExporterRecord
from previous call, that otherwise is opaque object only known to the target implementation. This way it can see that if it doesn't have completeCacheLink
array it can skip the whole export for that record.Add
would only be called if caller has at least one validCacheLink
for each dependency. With the bool return value, target implementation can signify that it does not wish to cache such a key and noCacheExporterRecord
is made for it.The result of
CacheExporterRecord
can now also be cached with theCacheKey
and reused without walking the full graph again. There does need to be a way to detect that no new keys were added anywhere in the subgraph by another concurrent build request, but at least if nothing changed, then there is no need to check backlinks again.The implementation can just return the previous record it had already created if
Add
is called with samedgst
for theCacheLink.Src
that it already knows about. There shouldn't be a need to normalize, deduplicate or check for loops in the implementation like there is now.Because this is quite a big refactor it might make sense to implement some debug tooling for comparing cache chains first. So that once the implementation is done we can compare the the result and make sure there are no unexpected regressions.
@jedevc @sipsma
The text was updated successfully, but these errors were encountered: