Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add --allow-missing for dvc commit #10524

Open
ermolaev94 opened this issue Aug 14, 2024 · 13 comments · May be fixed by #10555
Open

Add --allow-missing for dvc commit #10524

ermolaev94 opened this issue Aug 14, 2024 · 13 comments · May be fixed by #10555
Labels
A: data-management Related to dvc add/checkout/commit/move/remove feature request Requesting a new feature

Comments

@ermolaev94
Copy link

Sometimes it's necessary to update pipeline hashes and params without downloading large datasets. Currently I can't run dvc commit on machine without data even if I'd like to update my source code machine.

It would be great to have similar to dvc repro flag providing a way to ignore files that are not inside cache.

@shcheklein shcheklein added the feature request Requesting a new feature label Aug 14, 2024
@shcheklein
Copy link
Member

Makes sense, should be quite straightforward to add. I hope someone from the community can pick it up.

@shcheklein shcheklein added help wanted A: data-management Related to dvc add/checkout/commit/move/remove labels Aug 14, 2024
@anunayasri
Copy link

I can give this a shot.

@ermolaev94 I am new to the community and need some help here. I understand that you need a flag --allow-missing is dvc commit command. The flag will skip the downloading step mentioned in the pipeline definition in dvc.yaml. Am I correct? Do you want to add something to this? I see that allow-missing flag is in other commands also. Do you think you would need this feature in other cli commands also?

@shcheklein
Copy link
Member

@anunayasri thanks! the scope for this is to add --allow-missing to the dvc commit command only for now (we can see if need it somewhere else later, I would wait for some demand).

The idea is that dvc commit would ignore updating .dvc and dvc.lock file (take previous values) for DVC tracked outputs / data that doesn't exist (not dvc pull-ed) in the workspace.

I think the use case comes from a practical issue like this. Let's say we have a pipeline that has a lot of data (input), output (models), etc. It also depends on some python files (source code tracked by Git).

Let's say we added a comment in that source file and we don't want to run the pipeline again (since it doesn't change the result but takes a lot of time to download the data and run it), but we want to update dvc.lock, .dvc hashes (so that dvc status shows that everything is up to date). That's where dvc commit comes to play usually. And --allow-missing would allow people to avoid running the dvc pull before that (that can be time consuming).

Let me know if that makes sense.

@anunayasri
Copy link

Thanks for the detailed response @shcheklein . This makes sense. I will try it out in some time and revert back.

@anunayasri
Copy link

@shcheklein Can you guide me how to reproduce this issue of downloading data on dvc commit. Or point me to a doc that describes this scenario.

I have tried the following -

  1. I am following the docs for Get Started > Pipelines.
  2. I add a comment in src/featurize.py
  3. dvc status shows the file as edited.
  4. I deleted data/data.xml. I believe this is the scenario we are trying to replicate.
  5. dvc commit fails saying that the dep doesn't exist.

@skshetry
Copy link
Member

skshetry commented Sep 12, 2024

I don't think this is a good issue for contributions, as there are lot of open product questions.

@ermolaev94, if you are not aware, dvc add can already update datasets without completely downloading datasets.
See https://dvc.org/doc/user-guide/data-management/modifying-large-datasets.

We could not implement this for commit last time, because it was not clear what partial commit means for pipelines.

This PR implements virtual operation for commit: #9440.

The problem is that dvc commit is primarily meant to be used with dvc repro --no-commit to reduce cache transfers when you are quickly experimenting and don't want to save to the cache yet.

In that case, you do a lot of dvc repro --no-commit while experimenting and then do a dvc commit at the final stage, when you want to transfer all files to the cache. IIRC supporting virtual operations in commit would break this basic scenario.

@shcheklein
Copy link
Member

@skshetry just to a bit more color (and @ermolaev94 can correct me):

if you are not aware, dvc add can already update datasets without completely downloading datasets.
See https://dvc.org/doc/user-guide/data-management/modifying-large-datasets.

I think this is about pipelines primarily, not dvc add

The problem is that dvc commit is primarily meant to be used with dvc repro --no-commit to reduce cache transfers when you are quickly experimenting and don't want to save to the cache yet.

I think we advertise a few use case actually. Including the one when you change quickly a dependency and don't want to run the whole pipeline. Here is the description: https://dvc.org/doc/command-reference/commit#description

We could not implement this for commit last time, because it was not clear what partial commit means for pipelines.

could you please clarify / do you remember where / when that discussion was happening?

@skshetry
Copy link
Member

skshetry commented Sep 13, 2024

I think this is about pipelines primarily, not dvc add.

Yup, I understand that. If you read in #9440, the suggestion from Ruslan is to add support for updating pipelines through dvc add. But I prefer to keep it separate.

I think we advertise a few use case actually. Including the one when you change quickly a dependency and don't want to run the whole pipeline. Here is the description: dvc.org/doc/command-reference/commit#description

Documentation is not quite right. Please read this comment: #9389 (comment).
Ruslan explained me (quite patiently) what commit is supposed to do. It took me quite some time to understand.

I had always thought of commit to be a way to sync workspace to the cache, but it was not quite right. dvc repro --no-commit creates a "staging", and commit commits the staging to the cache. That's more of a correct explanation for commit.

could you please clarify / do you remember where / when that discussion was happening?

I hope the discussion in #9389 and #9440 captures most of the thing.
(cc @dberenbaum, maybe you remember some discussion with Ruslan on this).

@shcheklein
Copy link
Member

Documentation is not quite right. Please read this comment: #9389 (comment).
Ruslan explained me (quite patiently) what commit is supposed to do. It took me quite some time to understand.

Hmm 🤔 I'm pretty sure (I can find the original implementation discussion probably) that dvc commit was introduced (after some push from the community) to provide a faster way to update .dvc files within pipelines (there was no dvc.lock at that time. Probably Ruslan's though process is some kind of evolution after --no-commit flags were introduced - hard to tell. He was trying to reconcile names, semantics, etc in some nice way?

Anyways,I think dvc commit does atm exactly what it is saying in the docs and is useful in the those scenarios (like the one that is described in this ticket).

I had always thought of commit to be a way to sync workspace to the cache, but it was not quite right. dvc repro --no-commit creates a "staging", and commit commits the staging to the cache. That's more of a correct explanation for commit.

I would go not from Git analogs but from practical DVC-specific scenarios if possible.

E.g. pin down existing state to .dvc and dvc.lock if you don't want to run dvc repro (too expensive).

It's way easier for me to think in these terms tbh. And then if needed map to Git analogs. It might well be that the name is not perfect and we might well have discrepancy in semantics for --no-commit and dvc commit - I would need to check this tbh. It's been a while :)

@skshetry
Copy link
Member

skshetry commented Sep 13, 2024

Sometimes it's necessary to update pipeline hashes and params without downloading large datasets. Currently I can't run dvc commit on machine without data even if I'd like to update my source code machine.

Sorry, I don't think I read it very carefully. I guess the source code itself is a dependency. Currently, dvc commit supports specifying outputs or stage names. But it does not support specifying dependency, so virtual operation is not needed here.

It looks like we already support allow_missing on Repo.commit(), so it is only a matter of exposing it to through the CLI.

dvc/dvc/repo/commit.py

Lines 45 to 54 in f56343d

def commit(
self,
target=None,
with_deps=False,
recursive=False,
force=False,
allow_missing=False,
data_only=False,
relink=True,
):

Although virtual operation would help for large datasets here.

@skshetry
Copy link
Member

skshetry commented Sep 13, 2024

Anyways,I think dvc commit does atm exactly what it is saying in the docs and is useful in the those scenarios (like the one that is described in this ticket).

But it does not, not by default.

By default, dvc commit commits the files specified in .dvc and dvc.lock file.
If the workspace matches with the hash in the .dvc/dvc.lock file, it silently commits it to the cache. If it does not, it asks you what to do. Or, in case of --force, it force commits to reflect the changes from your workspace.

So, what dvc commit does by default, without prompt and without --force flag, is not mentioned in the docs or is not explicit.
These days, most of the commit usage is likely to commit changes from your workspace. But that's not how dvc works by default.

I'm pretty sure (I can find the original implementation discussion probably) that dvc commit was introduced (after some push from the community) to provide a faster way to update .dvc files within pipelines (there was no dvc.lock at that time.

Reading #919 (and, #1601), Ruslan is right here. dvc commit and repro --no-commit were introduced together.

@anunayasri
Copy link

Looks like this issue needs more discussion. I think I should stop working on it for now.

I don't think this is a good issue for contributions, as there are lot of open product questions.

@skshetry I am looking to contribute to the repo. Could you please point me to beginner friendly product questions that I can work on.

@skshetry
Copy link
Member

@skshetry I am looking to contribute to the repo. Could you please point me to beginner friendly product questions that I can work on.

Hi, I think it's okay to implement --allow-missing, which is what is being asked here.
As I mentioned above, Repo.commit() already implements this, we only need to expose this in the CLI.

To give you more information, internally, every command in dvc mirrors an API of same name in Repo class (with a few exceptions). Eg: dvc add calls Repo.add(), and dvc commit calls Repo.commit().

class CmdCommit(CmdBase):

def commit(

Repo.commit() already seem to implement allow_missing, but we also need to make sure it works.

You can take an example of dvc checkout on how it implements --allow-missing as an example.

checkout_parser.add_argument(
"--allow-missing",
action="store_true",
default=False,
help="Ignore errors if some of the files or directories are missing.",
)

anunayasri added a commit to anunayasri/dvc that referenced this issue Sep 13, 2024
@anunayasri anunayasri linked a pull request Sep 13, 2024 that will close this issue
2 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A: data-management Related to dvc add/checkout/commit/move/remove feature request Requesting a new feature
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants