-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fs: introduce fs.find() #5879
fs: introduce fs.find() #5879
Conversation
See the comments from the old PR: #5878 |
6128d9d
to
0efaf8e
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🔥
# directories since they are represented as files. This condition | ||
# checks whether we should yield an empty list (if it is an empty | ||
# directory) or just yield the file itself. | ||
if len(files) == 1 and files[0] == path and self.isdir(path_info): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
would calling full isdir
be wasteful? Or does fsspec caching compensate for that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, not really. We only call the isdir()
if it is a file and see whether the file is actually a file or just a directory. So unless we start calling walk_files()
on normal files, then this would cost nothing.
The last check will simply ensure that the this is indeed a directory, and would only cost stuff when import-url
in empty directories etc. Which I guess shouldn't be much of a big deal.
def walk_files(self, path_info, **kwargs): | ||
for file in self.ls(path_info, recursive=True): | ||
for file in self.find(path_info): | ||
yield path_info.replace(path=file) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like getting rid of walk_files
will now be quite easy ;)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah, find()
is basically walk_files
without path_infos
Similar to #5683 (comment), would it be realistic to summarize the performance differences here? |
To sum it up, the partial status used to take 54 seconds for azure and 32 seconds for google cloud and now it takes 4 seconds for azure and 3 seconds for google cloud. Pushing 1024 new files used to take 90/75 seconds respectively and now it it is taking 40/51. So for partial status, there is an improvement over 15x and for partial status the improvement is about 1.5x-2x for fsspec-based providers. before:
after:
Just for reference (this is the normal s3, not the fsspec one so nothing changed on it);
|
Will this show up in any dvc-bench tests? |
No, dvc-bench tests will use s3 as the remote and they are very limited. These improvements are present in azure and google storage |
Would be great if we could add those at some point 😁 . @efiop Do you think this makes sense as a separate issue? |
@dberenbaum Sure, it is on our TODO list, @isidentical will likely repurpose his scripts into dvc-bench benchmarks in the near future (there are some PRs already). iterative/dvc-bench#252 |
This is a really great start @isidentical . We need a bit more advanced (real life) benchmarks to my mind though. I would love to see cases like - 300K on remote + I'm pushing 1 file, 10 files, 1024 files (where is the threshold here when it stars listing all the files?), 1, 10, 1024 files on remote and I'm pushing 300K. (not specifically related to this change, but who knows? that's why we have tests)- I have a directory with 300K files and add 1 file and push it again (+different size on the remote end, up to 1M files). I would also talk to @pmrowla - he has a lot of insights on different thresholds and optimizations that we made before.
|
From what I can see, it might take multiple hours per run (which needs to be doubled for also running on the baseline branch and then compared). |
okay, I got 300K from the top of my head, primary to emphasize the scale. My point is that on 1024 files we might have some short cuts in place, or dis-balance (remote - local cache) is not enough to cover certain optimizations, etc. That's what I'm worried about. |
(also, we might not need even to run all these benchmarks every day, but it's good to have them to run when we do major updates, or before a major release) |
Ah, I see. I'll try to use for the s3fs PR (this one was a net gain in any scenario, so didn't bother much). Perhaps something like 16K/32K. |
How are you generating data for these benchmark tests? |
random bytes, not a real dataset. |
As was mentioned in https://discord.com/channels/485586884165107732/485596304961962003/831500146772148244, s5cmd is a speedy option that I'm finding much faster than the default |
What do you mean as an upper bound? |
I mean that s5cmd performance would be like a goal that dvc could try to get as close to possible to since its a library focused solely on speed of s3 transfers. We could include it as a benchmark to compare our performance, potentially having specified reasons why it's not realistic to reach that level of performance, and track whether dvc is getting closer or further from that benchmark over time. |
Reminds me this one - iterative/dvc-s3#23 . @tyommik has done a good research, and indeed s5cmd was 6-7x faster than the regular aws cli. |
I guess so, though it might better if we continue on a separate issue. Perhaps iterative/dvc-s3#23, since it is a bit irrelevant for |
Resolves #5877. This patch also drops ls(recursive=True) option and replaces all usages with the new fs.find() (fsspec compliant API).