fs: introduce fs.find() #5879

isidentical · 2021-04-26T13:36:30Z

Resolves #5877. This patch also drops ls(recursive=True) option and replaces all usages with the new fs.find() (fsspec compliant API).

isidentical · 2021-04-26T13:36:52Z

See the comments from the old PR: #5878

efiop

🔥

efiop · 2021-04-27T18:54:52Z

dvc/fs/fsspec_wrapper.py

+ # directories since they are represented as files. This condition
+ # checks whether we should yield an empty list (if it is an empty
+ # directory) or just yield the file itself.
+ if len(files) == 1 and files[0] == path and self.isdir(path_info):


would calling full isdir be wasteful? Or does fsspec caching compensate for that?

No, not really. We only call the isdir() if it is a file and see whether the file is actually a file or just a directory. So unless we start calling walk_files() on normal files, then this would cost nothing.

The last check will simply ensure that the this is indeed a directory, and would only cost stuff when import-urlin empty directories etc. Which I guess shouldn't be much of a big deal.

efiop · 2021-04-27T19:27:17Z

dvc/fs/fsspec_wrapper.py

 def walk_files(self, path_info, **kwargs):
- for file in self.ls(path_info, recursive=True):
+ for file in self.find(path_info):
 yield path_info.replace(path=file)


Looks like getting rid of walk_files will now be quite easy ;)

yeah, find() is basically walk_files without path_infos

dberenbaum · 2021-04-27T20:34:46Z

Similar to #5683 (comment), would it be realistic to summarize the performance differences here?

isidentical · 2021-04-27T20:36:42Z

Similar to #5683 (comment), would it be realistic to summarize the performance differences here?

Yeah, we were talking this with @efiop today. I'll probably end up with finalizing a framework where I can automatize running these kind of stuff.

isidentical · 2021-04-28T10:35:11Z

To sum it up, the partial status used to take 54 seconds for azure and 32 seconds for google cloud and now it takes 4 seconds for azure and 3 seconds for google cloud. Pushing 1024 new files used to take 90/75 seconds respectively and now it it is taking 40/51. So for partial status, there is an improvement over 15x and for partial status the improvement is about 1.5x-2x for fsspec-based providers.

before:

=====================================azure======================================                                                    
    Story: basic data cloud                                                                                                         
        push (1024 small files) took 21.9323 seconds                                                                                
        pull (1024 small files) took 34.3765 seconds                                                                                
    Story: cloud status
        fresh status (nothing missing on the remote) took 1.9444 seconds                                                            
        status (1024 files missing on the remote) took 54.6923 seconds                                                              
        push only new files (1024 new small files / 1024 existing small files) took 89.0522 seconds                                 
=======================================gs=======================================                                                    
    Story: basic data cloud                                                                                                         
        push (1024 small files) took 29.1087 seconds
        pull (1024 small files) took 32.3727 seconds
    Story: cloud status
        fresh status (nothing missing on the remote) took 2.5915 seconds
        status (1024 files missing on the remote) took 32.7394 seconds
        push only new files (1024 new small files / 1024 existing small files) took 76.1696 seconds

after:

=====================================azure======================================                                                    
    Story: basic data cloud                                                                                                         
        push (1024 small files) took 19.1998 seconds                                                                                
        pull (1024 small files) took 33.443 seconds                                                                                 
    Story: cloud status                                                                                                             
        fresh status (nothing missing on the remote) took 1.8277 seconds                                                            
        status (1024 files missing on the remote) took 4.4486 seconds                                                               
        push only new files (1024 new small files / 1024 existing small files) took 40.8264 seconds                                 
=======================================gs=======================================
    Story: basic data cloud
        push (1024 small files) took 27.9613 seconds
        pull (1024 small files) took 30.029 seconds
    Story: cloud status
        fresh status (nothing missing on the remote) took 2.5847 seconds
        status (1024 files missing on the remote) took 3.8758 seconds
        push only new files (1024 new small files / 1024 existing small files) took 51.135 seconds

Just for reference (this is the normal s3, not the fsspec one so nothing changed on it);

=======================================s3=======================================                                                    
    Story: basic data cloud                                                                                                         
        push (1024 small files) took 40.4514 seconds                                                                                
        pull (1024 small files) took 61.8855 seconds                                                                                
    Story: cloud status
        fresh status (nothing missing on the remote) took 2.7203 seconds                                                            
        status (1024 files missing on the remote) took 5.2392 seconds                                                               
        push only new files (1024 new small files / 1024 existing small files) took 76.0406 seconds

dberenbaum · 2021-04-28T14:53:55Z

Will this show up in any dvc-bench tests?

isidentical · 2021-04-28T14:58:34Z

Will this show up in any dvc-bench tests?

No, dvc-bench tests will use s3 as the remote and they are very limited. These improvements are present in azure and google storage

dberenbaum · 2021-04-28T15:10:59Z

Would be great if we could add those at some point 😁 . @efiop Do you think this makes sense as a separate issue?

efiop · 2021-04-28T15:35:10Z

@dberenbaum Sure, it is on our TODO list, @isidentical will likely repurpose his scripts into dvc-bench benchmarks in the near future (there are some PRs already). iterative/dvc-bench#252

shcheklein · 2021-04-28T21:12:05Z

This is a really great start @isidentical .

We need a bit more advanced (real life) benchmarks to my mind though. I would love to see cases like - 300K on remote + I'm pushing 1 file, 10 files, 1024 files (where is the threshold here when it stars listing all the files?), 1, 10, 1024 files on remote and I'm pushing 300K. (not specifically related to this change, but who knows? that's why we have tests)- I have a directory with 300K files and add 1 file and push it again (+different size on the remote end, up to 1M files).

I would also talk to @pmrowla - he has a lot of insights on different thresholds and optimizations that we made before.

compare this with some baseline - s3 tools optimized to the maximum performance (basically when network becomes a limiting factor)

isidentical · 2021-04-28T21:18:04Z

We need a bit more advanced (real life) benchmarks to my mind though. I would love to see cases like - 300K on remote + I'm pushing 1 file, 10 files, 1024 files (where is the threshold here when it stars listing all the files?), 1, 10, 1024 files on remote and I'm pushing 300K. (not specifically related to this change, but who knows? that's why we have tests)- I have a directory with 300K files and add 1 file and push it again (+different size on the remote end, up to 1M files).

From what I can see, it might take multiple hours per run (which needs to be doubled for also running on the baseline branch and then compared).

shcheklein · 2021-04-28T21:36:35Z

okay, I got 300K from the top of my head, primary to emphasize the scale. My point is that on 1024 files we might have some short cuts in place, or dis-balance (remote - local cache) is not enough to cover certain optimizations, etc. That's what I'm worried about.

shcheklein · 2021-04-28T21:47:06Z

(also, we might not need even to run all these benchmarks every day, but it's good to have them to run when we do major updates, or before a major release)

isidentical · 2021-04-29T13:00:29Z

Ah, I see. I'll try to use for the s3fs PR (this one was a net gain in any scenario, so didn't bother much). Perhaps something like 16K/32K.

dberenbaum · 2021-04-29T14:23:02Z

How are you generating data for these benchmark tests?

isidentical · 2021-04-29T14:48:17Z

random bytes, not a real dataset.

dberenbaum · 2021-04-30T19:28:52Z

* compare this with some baseline - s3 tools optimized to the maximum performance (basically when network becomes a limiting factor)

As was mentioned in https://discord.com/channels/485586884165107732/485596304961962003/831500146772148244, s5cmd is a speedy option that I'm finding much faster than the default aws s3 cp when playing around. What about using this as an upper bound for max performance for s3 transfers?

isidentical · 2021-04-30T19:44:57Z

What do you mean as an upper bound?

dberenbaum · 2021-04-30T20:16:08Z

I mean that s5cmd performance would be like a goal that dvc could try to get as close to possible to since its a library focused solely on speed of s3 transfers. We could include it as a benchmark to compare our performance, potentially having specified reasons why it's not realistic to reach that level of performance, and track whether dvc is getting closer or further from that benchmark over time.

shcheklein · 2021-04-30T23:31:25Z

Reminds me this one - iterative/dvc-s3#23 . @tyommik has done a good research, and indeed s5cmd was 6-7x faster than the regular aws cli.

isidentical · 2021-05-01T11:25:27Z

I guess so, though it might better if we continue on a separate issue. Perhaps iterative/dvc-s3#23, since it is a bit irrelevant for find() (it doesn't change anything s3-wise)

isidentical mentioned this pull request Apr 26, 2021

fs: introduce tree.find() #5878

Closed

fs: introduce tree.find()

0efaf8e

isidentical force-pushed the fsspec-use-find branch from 6128d9d to 0efaf8e Compare April 26, 2021 14:02

efiop changed the title ~~[WIP] fs: introduce tree.find()~~ [WIP] fs: introduce fs.find() Apr 26, 2021

efiop approved these changes Apr 26, 2021

View reviewed changes

efiop reviewed Apr 27, 2021

View reviewed changes

isidentical changed the title ~~[WIP] fs: introduce fs.find()~~ fs: introduce fs.find() Apr 28, 2021

efiop merged commit 2172add into master Apr 28, 2021

efiop deleted the fsspec-use-find branch April 28, 2021 10:39

efiop added optimize Optimizes DVC refactoring Factoring and re-factoring labels Apr 29, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fs: introduce fs.find() #5879

fs: introduce fs.find() #5879

isidentical commented Apr 26, 2021 •

edited by efiop

Loading

isidentical commented Apr 26, 2021

efiop left a comment

efiop Apr 27, 2021

isidentical Apr 27, 2021

efiop Apr 27, 2021

isidentical Apr 27, 2021

dberenbaum commented Apr 27, 2021

isidentical commented Apr 27, 2021

isidentical commented Apr 28, 2021

dberenbaum commented Apr 28, 2021

isidentical commented Apr 28, 2021

dberenbaum commented Apr 28, 2021

efiop commented Apr 28, 2021

shcheklein commented Apr 28, 2021

isidentical commented Apr 28, 2021

shcheklein commented Apr 28, 2021

shcheklein commented Apr 28, 2021

isidentical commented Apr 29, 2021

dberenbaum commented Apr 29, 2021

isidentical commented Apr 29, 2021

dberenbaum commented Apr 30, 2021

isidentical commented Apr 30, 2021

dberenbaum commented Apr 30, 2021

shcheklein commented Apr 30, 2021

isidentical commented May 1, 2021

fs: introduce fs.find() #5879

fs: introduce fs.find() #5879

Conversation

isidentical commented Apr 26, 2021 • edited by efiop Loading

isidentical commented Apr 26, 2021

efiop left a comment

Choose a reason for hiding this comment

efiop Apr 27, 2021

Choose a reason for hiding this comment

isidentical Apr 27, 2021

Choose a reason for hiding this comment

efiop Apr 27, 2021

Choose a reason for hiding this comment

isidentical Apr 27, 2021

Choose a reason for hiding this comment

dberenbaum commented Apr 27, 2021

isidentical commented Apr 27, 2021

isidentical commented Apr 28, 2021

dberenbaum commented Apr 28, 2021

isidentical commented Apr 28, 2021

dberenbaum commented Apr 28, 2021

efiop commented Apr 28, 2021

shcheklein commented Apr 28, 2021

isidentical commented Apr 28, 2021

shcheklein commented Apr 28, 2021

shcheklein commented Apr 28, 2021

isidentical commented Apr 29, 2021

dberenbaum commented Apr 29, 2021

isidentical commented Apr 29, 2021

dberenbaum commented Apr 30, 2021

isidentical commented Apr 30, 2021

dberenbaum commented Apr 30, 2021

shcheklein commented Apr 30, 2021

isidentical commented May 1, 2021

isidentical commented Apr 26, 2021 •

edited by efiop

Loading