-
Notifications
You must be signed in to change notification settings - Fork 1.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[backend] the cache server does not check whether the cached artifacts have been deleted #7939
Comments
I'm not entirely sure about v1 behavior, but in v2, we don't make any assumption on whether an artifact has content in its path. For instance, You could disable caching when submitting a pipeline run though, the SDK client has a parameter for that. |
This is a fix for v1 behavior only. I do not know about v2 behavior. The pipeline just breaks, if the S3 content is deleted so somehow that must be fixed. Do you have another proposal? |
No, I think it's by design. If users deletes artifacts, shouldn't that be considered user error? |
I believe that @juliusvonkohout raised a very important issue here. In many scenarios users are legally required to cleanup old records and keeping the cachedb in sync is definitely a pain point. In the past, the kfp project has also not considered this a user error and it was even recommended as best practice for production grade deployments by @Bobgy (#6204 (comment)). |
Yes, @chensun please reopen, it is clearly a bug. |
@chensun as a follow up to the meeting i do not see a global setting here https://www.kubeflow.org/docs/components/pipelines/overview/caching/ |
Discussed this in today KFP community meeting, and I want to summarize it here:
First of all, what you quoted from @Bobgy (#6204 (comment)) doesn't conflict with what I said above. I didn't mean the action of deleting artifacts from cloud storage itself is a user error. My reply is within the specific context of this issue, deleting the artifacts while replying on caching to continue working is a user error. It's like in C, you have a pointer that points to some allocated memory, freeing the memory is absolutely normal, but keep accessing the pointer would be a user error. If users would delete artifacts periodically, they should probably configure pipelines runs with a cache staleness setting (understood it's different from cache lifecycle but I think still solves this use case). Or they can disable caching for new runs after artifact deletion. From this point, this is a not a bug. I'm against the proposed fix because as I mentioned an artifact has a remote storage URI and metadata, both are optional. We can't assume there's content in the URI, and I'm not sure if we can assume the URI always exists even if users never write to it--it's rather an implementation detail that may vary for different platforms. We could check and verify this, but I don't think cache server should care at all an external system. I personally would prefer to improve the caching configuration story--if the existing caching setting is not that obvious or easy to use, maybe we can add a global config. Or as @connor-mccarthy mentioned, maybe we provide a way for users to delete cache entires in the database--it could be as simple as a script to begin with as long as it's well documented. |
I think the difference is that the user is not actively "relying on caching". For the user it just happens automatically when using the same component definition and arguments. Having this trigger automatically but then fail is confusing for users. If I understand you correctly, you are suggesting that a user should manually disable caching when the corresponding artifact was deleted. But how would that work in practice? Does the user have to figure out the component definition and arguments that produced the artifact and note them down to disable caching whenever accidentally using the same definitions again in the future? It would be impossible to keep track of all items that the cache server still considers cached but are already deleted from storage (the cachedb is always growing and never cleaned after all). I agree that there are other approaches to solve this problem. For example you could have an option somewhere in the UI to delete an artifact which would clean the artifact from S3, mlmd and the cachedb (but then again that's not supported by mlmd as far as I know). But recommending to delete artifacts from S3 is incompatible with the current caching behaviour in my eyes. |
We cannot expect the user to set it for every pipeline himself. That is too error prone as @MatthiasCarnein explains. We need a global expiration setting (maximum cache lifetime), for example an environment variable in the cache server that is respected by V1 and V2 pipelines. This setting could also enforce the deletion of old database entries according to the expiration setting. So we could reuse the parts of my PR that clean the database entries (cachedb and maybe mlmd for v2 ). Then we can use a lifecyclepolicy on the object storage with the same timeframe. The user can use a cache staleness up to the cache expiration duration then. Iit would be a user error then if he manually deletes Artifacts before the lifecycle policies does. Does this sound reasonable to you? |
That's a fair argument. On the other hand though, I don't think it's too much to ask for from the users to understand that KFP has caching on by default. After all, I perceive KFP as an advanced developer tool. Users probably should understand the possible impacts of deleting artifacts.
In practice, how do users deleting artifacts? I guess they don't cherry-pick specific artifacts to delete but rather delete artifacts in bulk under some root path. Then they could disable caching for all their pipelines or a set of them. If they do delete artifacts in a precise cherry-picking manner, they probably tracks what pipelines and inputs produced those artifacts already.
Indeed, if KFP provides the feature for user to delete artifacts, which is what Bobgy suggested in the other thread, then it should be responsible for cleaning up all the related items that should and can be deleted--cache entry definitely falls into this category. In summary, I'm supportive of the following options:
|
Shall i implement the last item or do you want do it? It would just be an environment variable MAX_CACHE_STALENESS in the cache-server derived from the configmap pipeline-install-config that uses the same syntax as per pipeline cache settings (e.g. P30D). Then in the cache server we modify pipelines/backend/src/cache/server/mutation.go Lines 120 to 129 in 555447f
pipelines/backend/src/cache/server/watcher.go Lines 129 to 136 in 555447f
to read the environment variable instead of -1. For example with if exists {
maxCacheStalenessInSeconds = getMaxCacheStaleness(maxCacheStaleness)
}
else {
maxCacheStalenessInSeconds = getMaxCacheStaleness("pod Environment variable maxCacheStaleness")
} Then in pipelines/backend/src/cache/storage/execution_cache_store.go Lines 38 to 47 in 555447f
pipelines/backend/src/cache/storage/execution_cache_store.go Lines 79 to 93 in 555447f
it will just be properly used. So we add three lines and an environment variable via the global configmap to satisfy most of the use cases. If desired we could also add a CACHE_EXPIRATION variable and delete the database entry with pipelines/backend/src/cache/storage/execution_cache_store.go Lines 137 to 143 in 555447f
For example after pipelines/backend/src/cache/storage/execution_cache_store.go Lines 61 to 89 in 555447f
Maybe it would even be better if we would just periodically execute an SQL command that cleans all entries older than CACHE_EXPIRATION in the cachedb table. |
And is there a documentation for V2 caching? pipelines/backend/src/cache/server/mutation.go Lines 97 to 101 in 555447f
|
Kubeflow definitely misses the ability to manage artifacts and I'm glad I am not the only one thinking so (also see #5783). I'd love to see invalidating the cache and deleting artifacts in the UI become a feature. However, MLMD doesn't seem to support deleting artifacts anytime soon and it wouldn't help us managing Kubeflow's cache, anyways. Thus, I've written a short PoC suggesting a way this feature could be implemented. Its integration should be self-explanatory: Code and concept are still in a very early stage. For more info on how the backend manages the cache, see here. I'm looking for feedback and discussions on this proposal and will gladly contribute to implementing it. |
@juliusvonkohout Thank you for the write up. Given this is a nontrivial user interface change. I would suggest you write a design doc so that we can conduct a proper design review by our team and the community. |
Thanks @TobiasGoerke , and same suggestion as writing a design doc so that we can review it properly. |
I've created a draft. What's the workflow for proposals? Where to post this proposal? Thanks. |
@TobiasGoerke just write me or Diana atanasova on the kubeflow slack. We have some experience with those drafts. |
Is job level caching config an option? The global caching config might be inflexible if different pipelines artifacts have different TTL? Especially if multiple users are sharing the same environment. |
You can still set a per pipeline and per step cache staleness. Please check #8270 (comment) and the design document linked there. |
Sadly we have to be agnostic of the backend until the new unified storage architecture from #7725 (comment) is there. So i got #8270 in instead of #7938 . V2 pipelines are not yet covered. |
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. |
This issue has been automatically closed because it has not had recent activity. Please comment "/reopen" to reopen it. |
@juliusvonkohout could we have this opened again? |
/reopen |
@juliusvonkohout: Reopened this issue. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
/lifecycle frozen |
Environment
Kubeflow 1.5.1
Steps to reproduce
The cache server wrongly states that the step is cached even if the artifacts have been deleted from S3. We propose #7938 that checks whether the folder actually still exists on S3. If the folder does not exist we do not wrongfully pretend that it is cached and instead delete the cachedb entry and let the pipeline step run again.
Impacted by this bug? Give it a 👍. We prioritise the issues with the most 👍.
The text was updated successfully, but these errors were encountered: