Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fs: introduce fs.find() #5879

Merged
merged 1 commit into from
Apr 28, 2021
Merged

fs: introduce fs.find() #5879

merged 1 commit into from
Apr 28, 2021

Conversation

isidentical
Copy link
Contributor

@isidentical isidentical commented Apr 26, 2021

Resolves #5877. This patch also drops ls(recursive=True) option and replaces all usages with the new fs.find() (fsspec compliant API).

@isidentical
Copy link
Contributor Author

See the comments from the old PR: #5878

@efiop efiop changed the title [WIP] fs: introduce tree.find() [WIP] fs: introduce fs.find() Apr 26, 2021
Copy link
Contributor

@efiop efiop left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔥

# directories since they are represented as files. This condition
# checks whether we should yield an empty list (if it is an empty
# directory) or just yield the file itself.
if len(files) == 1 and files[0] == path and self.isdir(path_info):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would calling full isdir be wasteful? Or does fsspec caching compensate for that?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, not really. We only call the isdir() if it is a file and see whether the file is actually a file or just a directory. So unless we start calling walk_files() on normal files, then this would cost nothing.

The last check will simply ensure that the this is indeed a directory, and would only cost stuff when import-urlin empty directories etc. Which I guess shouldn't be much of a big deal.

Comment on lines 115 to 117
def walk_files(self, path_info, **kwargs):
for file in self.ls(path_info, recursive=True):
for file in self.find(path_info):
yield path_info.replace(path=file)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like getting rid of walk_files will now be quite easy ;)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, find() is basically walk_files without path_infos

@dberenbaum
Copy link
Collaborator

Similar to #5683 (comment), would it be realistic to summarize the performance differences here?

@isidentical
Copy link
Contributor Author

Similar to #5683 (comment), would it be realistic to summarize the performance differences here?

Yeah, we were talking this with @efiop today. I'll probably end up with finalizing a framework where I can automatize running these kind of stuff.

@isidentical
Copy link
Contributor Author

To sum it up, the partial status used to take 54 seconds for azure and 32 seconds for google cloud and now it takes 4 seconds for azure and 3 seconds for google cloud. Pushing 1024 new files used to take 90/75 seconds respectively and now it it is taking 40/51. So for partial status, there is an improvement over 15x and for partial status the improvement is about 1.5x-2x for fsspec-based providers.

before:

=====================================azure======================================                                                    
    Story: basic data cloud                                                                                                         
        push (1024 small files) took 21.9323 seconds                                                                                
        pull (1024 small files) took 34.3765 seconds                                                                                
    Story: cloud status
        fresh status (nothing missing on the remote) took 1.9444 seconds                                                            
        status (1024 files missing on the remote) took 54.6923 seconds                                                              
        push only new files (1024 new small files / 1024 existing small files) took 89.0522 seconds                                 
=======================================gs=======================================                                                    
    Story: basic data cloud                                                                                                         
        push (1024 small files) took 29.1087 seconds
        pull (1024 small files) took 32.3727 seconds
    Story: cloud status
        fresh status (nothing missing on the remote) took 2.5915 seconds
        status (1024 files missing on the remote) took 32.7394 seconds
        push only new files (1024 new small files / 1024 existing small files) took 76.1696 seconds

after:

=====================================azure======================================                                                    
    Story: basic data cloud                                                                                                         
        push (1024 small files) took 19.1998 seconds                                                                                
        pull (1024 small files) took 33.443 seconds                                                                                 
    Story: cloud status                                                                                                             
        fresh status (nothing missing on the remote) took 1.8277 seconds                                                            
        status (1024 files missing on the remote) took 4.4486 seconds                                                               
        push only new files (1024 new small files / 1024 existing small files) took 40.8264 seconds                                 
=======================================gs=======================================
    Story: basic data cloud
        push (1024 small files) took 27.9613 seconds
        pull (1024 small files) took 30.029 seconds
    Story: cloud status
        fresh status (nothing missing on the remote) took 2.5847 seconds
        status (1024 files missing on the remote) took 3.8758 seconds
        push only new files (1024 new small files / 1024 existing small files) took 51.135 seconds

Just for reference (this is the normal s3, not the fsspec one so nothing changed on it);

=======================================s3=======================================                                                    
    Story: basic data cloud                                                                                                         
        push (1024 small files) took 40.4514 seconds                                                                                
        pull (1024 small files) took 61.8855 seconds                                                                                
    Story: cloud status
        fresh status (nothing missing on the remote) took 2.7203 seconds                                                            
        status (1024 files missing on the remote) took 5.2392 seconds                                                               
        push only new files (1024 new small files / 1024 existing small files) took 76.0406 seconds                                 

@isidentical isidentical changed the title [WIP] fs: introduce fs.find() fs: introduce fs.find() Apr 28, 2021
@efiop efiop merged commit 2172add into master Apr 28, 2021
@efiop efiop deleted the fsspec-use-find branch April 28, 2021 10:39
@dberenbaum
Copy link
Collaborator

Will this show up in any dvc-bench tests?

@isidentical
Copy link
Contributor Author

Will this show up in any dvc-bench tests?

No, dvc-bench tests will use s3 as the remote and they are very limited. These improvements are present in azure and google storage

@dberenbaum
Copy link
Collaborator

Would be great if we could add those at some point 😁 . @efiop Do you think this makes sense as a separate issue?

@efiop
Copy link
Contributor

efiop commented Apr 28, 2021

@dberenbaum Sure, it is on our TODO list, @isidentical will likely repurpose his scripts into dvc-bench benchmarks in the near future (there are some PRs already). iterative/dvc-bench#252

@shcheklein
Copy link
Member

This is a really great start @isidentical .

We need a bit more advanced (real life) benchmarks to my mind though. I would love to see cases like - 300K on remote + I'm pushing 1 file, 10 files, 1024 files (where is the threshold here when it stars listing all the files?), 1, 10, 1024 files on remote and I'm pushing 300K. (not specifically related to this change, but who knows? that's why we have tests)- I have a directory with 300K files and add 1 file and push it again (+different size on the remote end, up to 1M files).

I would also talk to @pmrowla - he has a lot of insights on different thresholds and optimizations that we made before.

  • compare this with some baseline - s3 tools optimized to the maximum performance (basically when network becomes a limiting factor)

@isidentical
Copy link
Contributor Author

We need a bit more advanced (real life) benchmarks to my mind though. I would love to see cases like - 300K on remote + I'm pushing 1 file, 10 files, 1024 files (where is the threshold here when it stars listing all the files?), 1, 10, 1024 files on remote and I'm pushing 300K. (not specifically related to this change, but who knows? that's why we have tests)- I have a directory with 300K files and add 1 file and push it again (+different size on the remote end, up to 1M files).

From what I can see, it might take multiple hours per run (which needs to be doubled for also running on the baseline branch and then compared).

@shcheklein
Copy link
Member

okay, I got 300K from the top of my head, primary to emphasize the scale. My point is that on 1024 files we might have some short cuts in place, or dis-balance (remote - local cache) is not enough to cover certain optimizations, etc. That's what I'm worried about.

@shcheklein
Copy link
Member

(also, we might not need even to run all these benchmarks every day, but it's good to have them to run when we do major updates, or before a major release)

@isidentical
Copy link
Contributor Author

Ah, I see. I'll try to use for the s3fs PR (this one was a net gain in any scenario, so didn't bother much). Perhaps something like 16K/32K.

@efiop efiop added optimize Optimizes DVC refactoring Factoring and re-factoring labels Apr 29, 2021
@dberenbaum
Copy link
Collaborator

How are you generating data for these benchmark tests?

@isidentical
Copy link
Contributor Author

random bytes, not a real dataset.

@dberenbaum
Copy link
Collaborator

* compare this with some baseline - s3 tools optimized to the maximum performance (basically when network becomes a limiting factor)

As was mentioned in https://discord.com/channels/485586884165107732/485596304961962003/831500146772148244, s5cmd is a speedy option that I'm finding much faster than the default aws s3 cp when playing around. What about using this as an upper bound for max performance for s3 transfers?

@isidentical
Copy link
Contributor Author

What do you mean as an upper bound?

@dberenbaum
Copy link
Collaborator

I mean that s5cmd performance would be like a goal that dvc could try to get as close to possible to since its a library focused solely on speed of s3 transfers. We could include it as a benchmark to compare our performance, potentially having specified reasons why it's not realistic to reach that level of performance, and track whether dvc is getting closer or further from that benchmark over time.

@shcheklein
Copy link
Member

Reminds me this one - iterative/dvc-s3#23 . @tyommik has done a good research, and indeed s5cmd was 6-7x faster than the regular aws cli.

@isidentical
Copy link
Contributor Author

I guess so, though it might better if we continue on a separate issue. Perhaps iterative/dvc-s3#23, since it is a bit irrelevant for find() (it doesn't change anything s3-wise)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
optimize Optimizes DVC refactoring Factoring and re-factoring
Projects
None yet
Development

Successfully merging this pull request may close these issues.

fsspec: use find() for walk_files() and optimize status calls
4 participants