-
Notifications
You must be signed in to change notification settings - Fork 145
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Deleting/Cleanup older TFX runs #69
Comments
Currently no APIs are supported to delete the provenance information, as deleting some of the nodes may break the lineage easily. A best practice is to use different mysql db instances to partition those pipelines runs, if they are independent. qq, how many pipelines are currently stored in your db, and which APIs are most affected in your case? |
Thanks for the reply. Would it be a good idea to add this as a feature request? Because at some point, there will be a need to delete runs from the database. Since the data for a single run is stored across multiple tables, it would be a good idea to have a cleanup API. Currently, I'm using a single database to store all the pipelines (one pipeline per customer). Each pipeline has a unique name. There are more than 15,000+ pipelines. Each pipeline having 100s of runs of its own. My requirement is, to be able to delete older runs in each pipeline based on some filter criteria. |
I think it is a good idea to explore the alternatives. Note the pipeline / run and how they are used in other runs are defined in the application level, i.e., TFX in this case. The scope of the subgraph to be deleted needs to be defined carefully. Let's add a FR in TFX and discuss what are the caveats and alternatives and potential tooling to do this (e.g., CLI, APIs) in this deployment mode, where all pipeline are kept in a single db.
e.g., for this case, abandon a run of a pipeline may be tricky in TFX, e.g., the run may used an artifact that is generated in a previous run, probably we need to at least keep that artifact generated by other runs. /cc some tfx folks: @ruoyu90 , @1025KB
Note keeping the runs helps to reason about provenance, e.g., what are the jobs used a particular dataset, etc. Apart from using separated db to isolate the runs, another alternative is to improve the API performance that TFX uses. What are the phases in tfx runs that you have noticed the performance downgrade? |
I've encountered another issue here #74 regarding hitting some maximum size limit. I believe isolating the dbs would not help in this case since it's for a single pipeline so some sort of cleanup would be necessary. |
Can we repurpose this issue to be generic "Deleting/Cleanup old MLMD entries"? |
/cc KFP folks on the pipeline deletion tools too @neuromage |
I know that the original issue was created almost 2 years ago but any luck with this functionality? |
I'm using MySQL db for storing the artifacts generated via TFX runs.
Its been a while TFX has been in production. Since many pipelines have run, MLMD database is getting filled up. Due to large tables, the performance of the database has decreased as well.
Is there a way to programmatically and graciously delete older runs to free up storage and improve DB performance?
The text was updated successfully, but these errors were encountered: