Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

parallel download support #268

Closed
lmeyerov opened this issue Aug 19, 2021 · 1 comment · Fixed by #420
Closed

parallel download support #268

lmeyerov opened this issue Aug 19, 2021 · 1 comment · Fixed by #420

Comments

@lmeyerov
Copy link

lmeyerov commented Aug 19, 2021

In benchmarks, we're finding significant speedups by using parallel downloads for MB+ individual blobs:

for a 200MB file on an az server <> az store, warm:

  • fs.download: 4.9s
  • file.write(await (await bc.download_blob()).readall())): 4.3s
  • file.write(await (await bc.download_blob(max_concurrency=16)).readall())): 1.3s
  • trickier: I think it's even bigger in practice b/c above was a warm read. cold read baselines were more 10-20s.

This is on a node with 8 cores and 2 NICs, and I think < 1 Gbps

Ideally:

  • smart default: concurrency defaults to # cores x2 or x4, something like that
  • unclear: handling of recursive case
  • allow manual overrides
  • works with dask, e.g., dd.read_parquet(...)
@hayesgb
Copy link
Collaborator

hayesgb commented Aug 22, 2021

@lmeyerov -- Really appreciate this benchmarking! I've put together a crack at this. It's currently implemented in the concurrent_io branch for:

fs.get_file()
fs.put_file()
fs.open()
AzureBlobFile.write()

When I timed this on my local machine, put_file gave me over 50% reduction in write times, so I'd appreciate any feedback you have.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants