Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

dvc push to Azure Blob storage is very slow #54

Closed
mitjaalge opened this issue Jul 28, 2023 · 19 comments · Fixed by iterative/dvc-objects#218
Closed

dvc push to Azure Blob storage is very slow #54

mitjaalge opened this issue Jul 28, 2023 · 19 comments · Fixed by iterative/dvc-objects#218
Assignees

Comments

@mitjaalge
Copy link

mitjaalge commented Jul 28, 2023

Bug Report

dvc push: is very slow when pushing to Azure Blob Storage remote

Description

When pushing larger files (~700MB) to a Azure Blob Storage remote I'm experiencing very slow speeds (3-4 min for a single file = ~4 MB/s). The same file takes around ~10s to upload (~70 MB/s) when using AzCopy (https://learn.microsoft.com/en-us/azure/storage/common/storage-use-azcopy-blobs-upload).
Is this to be expected, or am I doing something wrong?

Reproduce

  1. dvc init
  2. dvc remote add -d myremote azure://dvc/
  3. dvc remote modify --local myremote connection_string 'BlobEndpoint=h...'
  4. add new large file (large_file.wav) to folder (~700 MB)
  5. dvc add large_file.wav
  6. dvc push

Expected

Fast upload speed ~ 70 MB/s.

Environment information

DVC Version: 3.1.0
Python Version: 3.8.10
OS: tried on MacOS Ventura 13.1 and Ubuntu 20.04.6 LTS

Thank you very much for the help!

@efiop efiop transferred this issue from iterative/dvc Jul 28, 2023
@shcheklein
Copy link
Member

Is this to be expected, or am I doing something wrong?

I don't think it's expected.

Could you please share some additional info dvc doctor and pip freeze.

Could you as matter of an experiment also try to use the https:/fsspec/adlfs directly (I think that we use it in DVC as a main library that does the actual data transfer). It can be reproduced with it directly, it'll be easier to identify the reason and approach it. Thanks for your help in advance.

@mitjaalge
Copy link
Author

mitjaalge commented Jul 31, 2023

Hi shcheklein, thanks for your quick response!
Following are the things you requested, it seems like the problem does not lie with DVC but with fsspec/adlfs:

dvc doctor

DVC version: 3.10.1 (pip)
-------------------------
Platform: Python 3.8.10 on Linux-5.15.0-76-generic-x86_64-with-glibc2.29
Subprojects:
	dvc_data = 2.8.1
	dvc_objects = 0.24.1
	dvc_render = 0.5.3
	dvc_task = 0.3.0
	scmrepo = 1.1.0
Supports:
	azure (adlfs = 2023.4.0, knack = 0.11.0, azure-identity = 1.13.0),
	http (aiohttp = 3.8.5, aiohttp-retry = 2.8.3),
	https (aiohttp = 3.8.5, aiohttp-retry = 2.8.3)
Config:
	Global: /home/mitja/.config/dvc
	System: /etc/xdg/dvc
Cache types: hardlink, symlink
Cache directory: ext4 on /dev/mapper/vgubuntu-root
Caches: local
Remotes: azure
Workspace directory: ext4 on /dev/mapper/vgubuntu-root
Repo: dvc, git
Repo.site_cache_dir: /var/tmp/dvc/repo/9652ec2be57d605a3c6e5af8060f83de

pip freeze

adlfs==2023.4.0
aiohttp==3.8.5
aiohttp-retry==2.8.3
aiosignal==1.3.1
amqp==5.1.1
antlr4-python3-runtime==4.9.3
appdirs==1.4.4
argcomplete==3.1.1
async-timeout==4.0.2
asyncssh==2.13.2
atpublic==4.0
attrs==23.1.0
azure-core==1.28.0
azure-datalake-store==0.0.53
azure-identity==1.13.0
azure-storage-blob==12.17.0
backports.zoneinfo==0.2.1
billiard==4.1.0
celery==5.3.1
certifi==2023.7.22
cffi==1.15.1
charset-normalizer==3.2.0
click==8.1.6
click-didyoumean==0.3.0
click-plugins==1.1.1
click-repl==0.3.0
cloudpickle==2.2.1
colorama==0.4.6
configobj==5.0.8
cryptography==41.0.2
dask==2023.5.0
dictdiffer==0.9.0
diskcache==5.6.1
distro==1.8.0
dpath==2.1.6
dulwich==0.21.5
dvc==3.10.1
dvc-azure==2.22.0
dvc-data==2.8.1
dvc-http==2.30.2
dvc-objects==0.24.1
dvc-render==0.5.3
dvc-studio-client==0.11.0
dvc-task==0.3.0
filelock==3.12.2
flatten-dict==0.4.2
flufl.lock==7.1.1
frozenlist==1.4.0
fsspec==2023.6.0
funcy==2.0
gitdb==4.0.10
GitPython==3.1.32
grandalf==0.8
hydra-core==1.3.2
idna==3.4
importlib-metadata==6.8.0
importlib-resourc
es==6.0.0
isodate==0.6.1
iterative-telemetry==0.0.8
jmespath==1.0.1
knack==0.11.0
kombu==5.3.1
locket==1.0.0
markdown-it-py==3.0.0
mdurl==0.1.2
msal==1.23.0
msal-extensions==1.0.0
multidict==6.0.4
nanotime==0.5.2
networkx==3.1
numpy==1.24.4
omegaconf==2.3.0
orjson==3.9.2
packaging==23.1
pandas==2.0.3
partd==1.4.0
pathspec==0.11.2
platformdirs==3.10.0
portalocker==2.7.0
prompt-toolkit==3.0.39
psutil==5.9.5
pycparser==2.21
pydot==1.4.2
pygit2==1.12.2
Pygments==2.15.1
pygtrie==2.5.0
PyJWT==2.8.0
pyparsing==3.1.1
python-dateutil==2.8.2
pytz==2023.3
PyYAML==6.0.1
requests==2.31.0
rich==13.5.0
ruamel.yaml==0.17.32
ruamel.yaml.clib==0.2.7
scmrepo==1.1.0
shortuuid==1.0.11
shtab==1.6.4
six==1.16.0
smmap==5.0.0
sqltrie==0.7.0
tabulate==0.9.0
tomlkit==0.12.1
toolz==0.12.0
tqdm==4.65.0
typing-extensions==4.7.1
tzdata==2023.3
urllib3==2.0.4
vine==5.0.0
voluptuous==0.13.1
wcwidth==0.2.6
yarl==1.9.2
zc.lockfile==3.0.post1
zipp==3.16.2

Experiment 1: az copy vs DVC vs fsspec/adlfs

A CSV file (size = 914.71 MiB) is uploaded to the same Azure Blob Container with 3 different ways (az copy, dvc and fsspec/adlfs) and the upload time is measured.

Library Duration
AzCopy 18s
DVC 101s
adlfs 60.66s

Code to upload with AzCopy

./azcopy copy "/path/to/local/file" "<SAS token to azure blob container>"

Code to upload with DVC

dvc init
dvc remote add -d myremote azure://dvc/
dvc remote modify --local myremote connection_string 'BlobEndpoint=h...'
dvc add /path/to/file
dvc push

Code to upload with fsspec/adlfs

import time
import dask.dataframe as dd

storage_options={'connection_string': 'BlobEndpoint=https://...'}

start_time = time.time()
ddf = dd.read_csv('/path/to/large.csv')
ddf.to_csv('abfs://dvc/large.csv', storage_options=storage_options)
end_time = time.time()

print('Upload time: {} seconds'.format(end_time - start_time))

Experiment 2: az copy vs DVC vs azure-sdk-for-python

A WAV file (size = 659.18 MiB) is uploaded to the same Azure Blob Container with 3 different ways (az copy, dvc and azure-sdk-for-python) and the upload time is measured.
azure-sdk-for-python

Library Duration
AzCopy 14s
DVC 83.36s
azure-sdk 72.69s

Code to upload with azure-sdk-for-python

import time
from azure.storage.blob import BlobServiceClient

connection_string = 'BlobEndpoint=https://...'

# Set up service client
service_client = BlobServiceClient.from_connection_string(connection_string)
blob_client = service_client.get_blob_client(container='dvc', blob='file_1.wav')

start_time = time.time()
blob_client.upload_blob(open('file_1.wav', 'rb').read())
end_time = time.time()

print('Upload time: {} seconds'.format(end_time - start_time))

@mitjaalge
Copy link
Author

I did an additional test with azure-sdk-for-python where i set the max_concurrency argument in the blob_client.upload_blob() function to various values. This significantly speeds up the upload speed and can result in speeds similar or better than az copy:

azure-sdk-for-python upload speed:

File size = 659.18 MiB

max_concurrency Duration [s]
1 70.3
2 29.9
3 16.5
4 13.5
5 12.0
6 12.2
7 12.1
8 12.7
9 12.6

I did not find a similar argument/ option for fsspec/adlfs yet..

@shcheklein
Copy link
Member

Thanks for the research @mitjaalge .

I think here is where fsspec runs it:

https:/fsspec/adlfs/blob/main/adlfs/spec.py#L1617-L1623

I also don't see it using any additional configs. Even though kwargs is passed- means that we could potentially add an option per-file-concurrency. I think we have the same issue with S3 (?).

@pmrowla @efiop I would assume in this case we would pass it to all fs (just some of them could ignore it). Where would be the place in DVC? (which one of fs interfaces?). Also, I wonder if we should make it hardcoded / by default and how should co-exist with -j? May be use the same single concurrency but dynamically change the number of job per file that we pass if we see that less then -j files left to upload, would it be possible?

@pmrowla
Copy link
Contributor

pmrowla commented Aug 1, 2023

It belongs in adlfs (like how s3fs uses multipart uploads).

DVC -j is equivalent to fsspec's batch_size now. If a filesystem implementation supports concurrent chunked uploads it's on the filesystem to decide how to divide up the concurrent operations - whether that's batch_size individual entire file uploads at once, or if it's one file at a time but with batch_size concurrent chunks for that file at once.

There's an old issue in adlfs where they started testing max_concurrency values but the user who requested the feature never followed up on it fsspec/adlfs#268. This probably just requires us to send an actual PR for the issue

@shcheklein
Copy link
Member

It sounds about right (and no matter what it will need to support the max_concurrency anyways). Could you send me to the point in s3fs that does this? Does it try to balance those two parallel model (intra- and inter- file).?

There's an old issue in adlfs where they started testing max_concurrency values but the user who requested the feature never followed up on it fsspec/adlfs#268. This probably just requires us to send an actual PR for the issue

👍

@pmrowla
Copy link
Contributor

pmrowla commented Aug 2, 2023

It looks like s3fs doesn't actually try to balance the two, they just schedule the entire multipart upload for files (and only try to throttle the batching at the full file level)

https:/fsspec/s3fs/blob/214d3bb385e01715e75f8fd007cbf6599e655461/s3fs/core.py#L1651-L1668

Since the chunked uploading is handled in the azure sdk (and we can't control how they schedule the indvidual chunk tasks) I think we can probably just get away with doing something like batch_size // 8 for upload/download blob operations in adlfs (this will give you 16 in the default case where users aren't setting fsspec batch_size or dvc --jobs themselves)

So in adlfs code it would be:

from fsspec.asyn import _get_batch_size

await bc.upload_blob(max_concurrency=max(1, _get_batch_size() // 8))
await bc.download_blob(max_concurrency=max(1, _get_batch_size() // 8))

(everywhere they use upload/download_blob)

for reference, the azure chunking is done here for uploads (the chunked download behavior is the same):

https:/Azure/azure-sdk-for-python/blob/e4f6a5d5fcd98af810435f618dcf9799530d7415/sdk/storage/azure-storage-blob/azure/storage/blob/_shared/uploads_async.py#L86-L94

@shcheklein
Copy link
Member

thanks @pmrowla . One more question - do we by default override batch size when --jobs is set? Also, where is the default set for the batch_size?

So in adlfs code it would be:

is it pretty much a complete fix for this? :) (except an optional way to control it, etc)

@pmrowla
Copy link
Contributor

pmrowla commented Aug 2, 2023

The default is set here: https:/fsspec/filesystem_spec/blob/166f462180e876308eae5cc753a65c695138ae56/fsspec/asyn.py#L172

The fsspec functions basically work by doing

def func(batch_size=None):
    batch_size = batch_size or _get_batch_size()

where _get_batch_size() usually uses the default constants but in certain cases will check available system resources

In DVC we just call the fsspec methods with func(batch_size=dvc_jobs)

Given that s3fs doesn't check batch_size at all at the multipart/chunked upload/download level (and just schedules the entire set of chunks with asyncio immediately), adlfs probably does not need to either. But since adlfs requires specifying some max_concurrency, the default 128 is probably too aggressive (in terms of sending too many tasks at once to asyncio), but 16 should be reasonable.

If we really wanted to just match s3fs behavior, we can dig more into the azure sdk to see what chunk size they use and then just set max_concurrency=number_of_chunks for a given file, but I don't think it's necessary here.

is it pretty much a complete fix for this? :) (except an optional way to control it, etc)

Yeah it should be, unless maybe the adlfs maintainers are aware of some other azure concurrency related issue here that I missed

@dberenbaum
Copy link

@shcheklein Do you plan to try to find time for this? Sounds great, just want to make sure we actually have a plan to implement it.

@pmrowla
Copy link
Contributor

pmrowla commented Aug 3, 2023

for reference, the default azcopy concurrency settings are documented here: https://learn.microsoft.com/en-us/azure/storage/common/storage-use-azcopy-optimize#increase-the-number-of-concurrent-requests

You can increase throughput by setting the AZCOPY_CONCURRENCY_VALUE environment variable. This variable specifies the number of concurrent requests that can occur.

If your computer has fewer than 5 CPUs, then the value of this variable is set to 32. Otherwise, the default value is equal to 16 multiplied by the number of CPUs. The maximum default value of this variable is 300, but you can manually set this value higher or lower.

In our case, max_concurrency=16 is lower than their default but we are also adding additional concurrency since we have separate batching at the file level (so it would really end up being the file batch_size * 16)

Since the default batch size in DVC/fsspec is 128, we probably end up with similar performance to azcopy for dir transfers (since we do concurrent file transfers), but we are slower for single file transfers (since right now we end up with a concurrency of 1).

@pmrowla pmrowla self-assigned this Aug 3, 2023
@pmrowla
Copy link
Contributor

pmrowla commented Aug 3, 2023

upstream PR fsspec/adlfs#420

@mitjaalge can you try installing the adlfs patch in your DVC env and then see if you get improved performance?

pip install 'adlfs @ git+https:/fsspec/adlfs.git@refs/pull/420/head'

After the patch, the concurrency behavior is still naive in that fsspec and DVC will either batch multi-file transfers at the file level or single file transfers at the chunk level, but not both in the same operation.

So for directory push/pull when the number of remaining files is less than --jobs/batch_size we could still improve performance in the future. But in the meantime the patch should fix the current worst-case scenario (which was pushing or pulling a single large file)

@pmrowla
Copy link
Contributor

pmrowla commented Aug 3, 2023

on my machine (with a residential internet connection) for a 1.5GB file I get comparable performance between azcopy and DVC with the patch

azcopy:

az storage azcopy blob upload -c peter-test-unversioned -s  -d   17.19s user 22.60s system 32% cpu 2:00.97 total

dvc main:

dvc push -v  6.82s user 5.87s system 8% cpu 2:22.52 total

dvc with the patch:

dvc push -v  8.64s user 6.91s system 13% cpu 1:57.24 total

@mitjaalge
Copy link
Author

mitjaalge commented Aug 3, 2023

@pmrowla:
Your patch works very well for me, thank you!
The behaviour with regards to the number of jobs (-j) is not exactly how i expected, but maybe you can make sense of it.

The following table shows the upload time for N files in a folder before and after the patch. Every file has a size of 691 MB. My internet speed is 482.51 Mbit/s down and 447.74 Mbit/s up.

N files N jobs Patch (420) Time
1 1 No 1min 37s
1 3 No 1min 11s
3 1 No 3min 37s
3 3 No 2min 53s
3 9 No 3min 1s
1 1 Yes 13s
1 3 Yes 13s
3 1 Yes 35s
3 3 Yes 35s
3 9 Yes 34s

BTW, your pip command didn't work for me, I had to use the following (if someone else wants to test it too):

pip install git+https:/fsspec/adlfs.git@refs/pull/420/head

@pmrowla
Copy link
Contributor

pmrowla commented Aug 3, 2023

The jobs flag isn't applied for single file transfers with that patch, that will require changes in dvc and not adlfs. So your results are what I would expect for now. (So when n files is 1 it always uses the default adlfs max_concurrency regardless of --jobs)

@mitjaalge
Copy link
Author

Should it not have given a bigger speedup for multiple files without the patch. 3 files 1 job is nearly the same as 3 files 3 jobs?

@pmrowla
Copy link
Contributor

pmrowla commented Aug 3, 2023

Ah sorry, with the patch, max_concurrency is still applied to file chunks even when you use --jobs. So what you get is basically --jobs * max_concurrency

@mitjaalge
Copy link
Author

Ok, makes sense.
Thank you for implementing the patch so quickly!

@pmrowla
Copy link
Contributor

pmrowla commented Aug 9, 2023

This fix is now released in adlfs and dvc-objects (so --jobs now works as expected for both single and multi file transfers)

The fix can be installed with

pip install -U adlfs dvc-objects

(it will also be available in the next DVC binary package release)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
No open projects
Archived in project
Development

Successfully merging a pull request may close this issue.

4 participants