dvc push to Azure Blob storage is very slow #54

mitjaalge · 2023-07-28T12:46:14Z

Bug Report

dvc push: is very slow when pushing to Azure Blob Storage remote

Description

When pushing larger files (~700MB) to a Azure Blob Storage remote I'm experiencing very slow speeds (3-4 min for a single file = ~4 MB/s). The same file takes around ~10s to upload (~70 MB/s) when using AzCopy (https://learn.microsoft.com/en-us/azure/storage/common/storage-use-azcopy-blobs-upload).
Is this to be expected, or am I doing something wrong?

Reproduce

dvc init
dvc remote add -d myremote azure://dvc/
dvc remote modify --local myremote connection_string 'BlobEndpoint=h...'
add new large file (large_file.wav) to folder (~700 MB)
dvc add large_file.wav
dvc push

Expected

Fast upload speed ~ 70 MB/s.

Environment information

DVC Version: 3.1.0
Python Version: 3.8.10
OS: tried on MacOS Ventura 13.1 and Ubuntu 20.04.6 LTS

Thank you very much for the help!

shcheklein · 2023-07-28T20:27:59Z

Is this to be expected, or am I doing something wrong?

I don't think it's expected.

Could you please share some additional info dvc doctor and pip freeze.

Could you as matter of an experiment also try to use the https:/fsspec/adlfs directly (I think that we use it in DVC as a main library that does the actual data transfer). It can be reproduced with it directly, it'll be easier to identify the reason and approach it. Thanks for your help in advance.

mitjaalge · 2023-07-31T11:51:55Z

Hi shcheklein, thanks for your quick response!
Following are the things you requested, it seems like the problem does not lie with DVC but with fsspec/adlfs:

dvc doctor

DVC version: 3.10.1 (pip)
-------------------------
Platform: Python 3.8.10 on Linux-5.15.0-76-generic-x86_64-with-glibc2.29
Subprojects:
	dvc_data = 2.8.1
	dvc_objects = 0.24.1
	dvc_render = 0.5.3
	dvc_task = 0.3.0
	scmrepo = 1.1.0
Supports:
	azure (adlfs = 2023.4.0, knack = 0.11.0, azure-identity = 1.13.0),
	http (aiohttp = 3.8.5, aiohttp-retry = 2.8.3),
	https (aiohttp = 3.8.5, aiohttp-retry = 2.8.3)
Config:
	Global: /home/mitja/.config/dvc
	System: /etc/xdg/dvc
Cache types: hardlink, symlink
Cache directory: ext4 on /dev/mapper/vgubuntu-root
Caches: local
Remotes: azure
Workspace directory: ext4 on /dev/mapper/vgubuntu-root
Repo: dvc, git
Repo.site_cache_dir: /var/tmp/dvc/repo/9652ec2be57d605a3c6e5af8060f83de

pip freeze

adlfs==2023.4.0
aiohttp==3.8.5
aiohttp-retry==2.8.3
aiosignal==1.3.1
amqp==5.1.1
antlr4-python3-runtime==4.9.3
appdirs==1.4.4
argcomplete==3.1.1
async-timeout==4.0.2
asyncssh==2.13.2
atpublic==4.0
attrs==23.1.0
azure-core==1.28.0
azure-datalake-store==0.0.53
azure-identity==1.13.0
azure-storage-blob==12.17.0
backports.zoneinfo==0.2.1
billiard==4.1.0
celery==5.3.1
certifi==2023.7.22
cffi==1.15.1
charset-normalizer==3.2.0
click==8.1.6
click-didyoumean==0.3.0
click-plugins==1.1.1
click-repl==0.3.0
cloudpickle==2.2.1
colorama==0.4.6
configobj==5.0.8

cryptography==41.0.2
dask==2023.5.0
dictdiffer==0.9.0
diskcache==5.6.1
distro==1.8.0
dpath==2.1.6
dulwich==0.21.5
dvc==3.10.1
dvc-azure==2.22.0
dvc-data==2.8.1
dvc-http==2.30.2
dvc-objects==0.24.1
dvc-render==0.5.3
dvc-studio-client==0.11.0
dvc-task==0.3.0
filelock==3.12.2
flatten-dict==0.4.2
flufl.lock==7.1.1
frozenlist==1.4.0
fsspec==2023.6.0
funcy==2.0
gitdb==4.0.10
GitPython==3.1.32
grandalf==0.8
hydra-core==1.3.2
idna==3.4
importlib-metadata==6.8.0
importlib-resourc

es==6.0.0
isodate==0.6.1
iterative-telemetry==0.0.8
jmespath==1.0.1
knack==0.11.0
kombu==5.3.1
locket==1.0.0
markdown-it-py==3.0.0
mdurl==0.1.2
msal==1.23.0
msal-extensions==1.0.0
multidict==6.0.4
nanotime==0.5.2
networkx==3.1
numpy==1.24.4
omegaconf==2.3.0
orjson==3.9.2
packaging==23.1
pandas==2.0.3
partd==1.4.0
pathspec==0.11.2
platformdirs==3.10.0
portalocker==2.7.0
prompt-toolkit==3.0.39
psutil==5.9.5
pycparser==2.21
pydot==1.4.2
pygit2==1.12.2

Pygments==2.15.1
pygtrie==2.5.0
PyJWT==2.8.0
pyparsing==3.1.1
python-dateutil==2.8.2
pytz==2023.3
PyYAML==6.0.1
requests==2.31.0
rich==13.5.0
ruamel.yaml==0.17.32
ruamel.yaml.clib==0.2.7
scmrepo==1.1.0
shortuuid==1.0.11
shtab==1.6.4
six==1.16.0
smmap==5.0.0
sqltrie==0.7.0
tabulate==0.9.0
tomlkit==0.12.1
toolz==0.12.0
tqdm==4.65.0
typing-extensions==4.7.1
tzdata==2023.3
urllib3==2.0.4
vine==5.0.0
voluptuous==0.13.1
wcwidth==0.2.6
yarl==1.9.2
zc.lockfile==3.0.post1
zipp==3.16.2

Experiment 1: az copy vs DVC vs fsspec/adlfs

A CSV file (size = 914.71 MiB) is uploaded to the same Azure Blob Container with 3 different ways (az copy, dvc and fsspec/adlfs) and the upload time is measured.

Library	Duration
AzCopy	18s
DVC	101s
adlfs	60.66s

Code to upload with AzCopy

./azcopy copy "/path/to/local/file" "<SAS token to azure blob container>"

Code to upload with DVC

dvc init
dvc remote add -d myremote azure://dvc/
dvc remote modify --local myremote connection_string 'BlobEndpoint=h...'
dvc add /path/to/file
dvc push

Code to upload with fsspec/adlfs

import time
import dask.dataframe as dd

storage_options={'connection_string': 'BlobEndpoint=https://...'}

start_time = time.time()
ddf = dd.read_csv('/path/to/large.csv')
ddf.to_csv('abfs://dvc/large.csv', storage_options=storage_options)
end_time = time.time()

print('Upload time: {} seconds'.format(end_time - start_time))

Experiment 2: az copy vs DVC vs azure-sdk-for-python

A WAV file (size = 659.18 MiB) is uploaded to the same Azure Blob Container with 3 different ways (az copy, dvc and azure-sdk-for-python) and the upload time is measured.
azure-sdk-for-python

Library	Duration
AzCopy	14s
DVC	83.36s
azure-sdk	72.69s

Code to upload with azure-sdk-for-python

import time
from azure.storage.blob import BlobServiceClient

connection_string = 'BlobEndpoint=https://...'

# Set up service client
service_client = BlobServiceClient.from_connection_string(connection_string)
blob_client = service_client.get_blob_client(container='dvc', blob='file_1.wav')

start_time = time.time()
blob_client.upload_blob(open('file_1.wav', 'rb').read())
end_time = time.time()

print('Upload time: {} seconds'.format(end_time - start_time))

mitjaalge · 2023-07-31T12:30:38Z

I did an additional test with azure-sdk-for-python where i set the max_concurrency argument in the blob_client.upload_blob() function to various values. This significantly speeds up the upload speed and can result in speeds similar or better than az copy:

azure-sdk-for-python upload speed:

File size = 659.18 MiB

max_concurrency	Duration [s]
1	70.3
2	29.9
3	16.5
4	13.5
5	12.0
6	12.2
7	12.1
8	12.7
9	12.6

I did not find a similar argument/ option for fsspec/adlfs yet..

shcheklein · 2023-08-01T00:56:31Z

Thanks for the research @mitjaalge .

I think here is where fsspec runs it:

https:/fsspec/adlfs/blob/main/adlfs/spec.py#L1617-L1623

I also don't see it using any additional configs. Even though kwargs is passed- means that we could potentially add an option per-file-concurrency. I think we have the same issue with S3 (?).

@pmrowla @efiop I would assume in this case we would pass it to all fs (just some of them could ignore it). Where would be the place in DVC? (which one of fs interfaces?). Also, I wonder if we should make it hardcoded / by default and how should co-exist with -j? May be use the same single concurrency but dynamically change the number of job per file that we pass if we see that less then -j files left to upload, would it be possible?

pmrowla · 2023-08-01T06:11:06Z

It belongs in adlfs (like how s3fs uses multipart uploads).

DVC -j is equivalent to fsspec's batch_size now. If a filesystem implementation supports concurrent chunked uploads it's on the filesystem to decide how to divide up the concurrent operations - whether that's batch_size individual entire file uploads at once, or if it's one file at a time but with batch_size concurrent chunks for that file at once.

There's an old issue in adlfs where they started testing max_concurrency values but the user who requested the feature never followed up on it fsspec/adlfs#268. This probably just requires us to send an actual PR for the issue

shcheklein · 2023-08-01T18:55:53Z

It sounds about right (and no matter what it will need to support the max_concurrency anyways). Could you send me to the point in s3fs that does this? Does it try to balance those two parallel model (intra- and inter- file).?

There's an old issue in adlfs where they started testing max_concurrency values but the user who requested the feature never followed up on it fsspec/adlfs#268. This probably just requires us to send an actual PR for the issue

👍

pmrowla · 2023-08-02T02:10:16Z

It looks like s3fs doesn't actually try to balance the two, they just schedule the entire multipart upload for files (and only try to throttle the batching at the full file level)

https:/fsspec/s3fs/blob/214d3bb385e01715e75f8fd007cbf6599e655461/s3fs/core.py#L1651-L1668

Since the chunked uploading is handled in the azure sdk (and we can't control how they schedule the indvidual chunk tasks) I think we can probably just get away with doing something like batch_size // 8 for upload/download blob operations in adlfs (this will give you 16 in the default case where users aren't setting fsspec batch_size or dvc --jobs themselves)

So in adlfs code it would be:

from fsspec.asyn import _get_batch_size

await bc.upload_blob(max_concurrency=max(1, _get_batch_size() // 8))
await bc.download_blob(max_concurrency=max(1, _get_batch_size() // 8))

(everywhere they use upload/download_blob)

for reference, the azure chunking is done here for uploads (the chunked download behavior is the same):

https:/Azure/azure-sdk-for-python/blob/e4f6a5d5fcd98af810435f618dcf9799530d7415/sdk/storage/azure-storage-blob/azure/storage/blob/_shared/uploads_async.py#L86-L94

shcheklein · 2023-08-02T02:21:22Z

thanks @pmrowla . One more question - do we by default override batch size when --jobs is set? Also, where is the default set for the batch_size?

So in adlfs code it would be:

is it pretty much a complete fix for this? :) (except an optional way to control it, etc)

pmrowla · 2023-08-02T02:32:14Z

The default is set here: https:/fsspec/filesystem_spec/blob/166f462180e876308eae5cc753a65c695138ae56/fsspec/asyn.py#L172

The fsspec functions basically work by doing

def func(batch_size=None):
    batch_size = batch_size or _get_batch_size()

where _get_batch_size() usually uses the default constants but in certain cases will check available system resources

In DVC we just call the fsspec methods with func(batch_size=dvc_jobs)

Given that s3fs doesn't check batch_size at all at the multipart/chunked upload/download level (and just schedules the entire set of chunks with asyncio immediately), adlfs probably does not need to either. But since adlfs requires specifying some max_concurrency, the default 128 is probably too aggressive (in terms of sending too many tasks at once to asyncio), but 16 should be reasonable.

If we really wanted to just match s3fs behavior, we can dig more into the azure sdk to see what chunk size they use and then just set max_concurrency=number_of_chunks for a given file, but I don't think it's necessary here.

is it pretty much a complete fix for this? :) (except an optional way to control it, etc)

Yeah it should be, unless maybe the adlfs maintainers are aware of some other azure concurrency related issue here that I missed

dberenbaum · 2023-08-02T12:58:52Z

@shcheklein Do you plan to try to find time for this? Sounds great, just want to make sure we actually have a plan to implement it.

pmrowla · 2023-08-03T03:00:22Z

for reference, the default azcopy concurrency settings are documented here: https://learn.microsoft.com/en-us/azure/storage/common/storage-use-azcopy-optimize#increase-the-number-of-concurrent-requests

You can increase throughput by setting the AZCOPY_CONCURRENCY_VALUE environment variable. This variable specifies the number of concurrent requests that can occur.

If your computer has fewer than 5 CPUs, then the value of this variable is set to 32. Otherwise, the default value is equal to 16 multiplied by the number of CPUs. The maximum default value of this variable is 300, but you can manually set this value higher or lower.

In our case, max_concurrency=16 is lower than their default but we are also adding additional concurrency since we have separate batching at the file level (so it would really end up being the file batch_size * 16)

Since the default batch size in DVC/fsspec is 128, we probably end up with similar performance to azcopy for dir transfers (since we do concurrent file transfers), but we are slower for single file transfers (since right now we end up with a concurrency of 1).

pmrowla · 2023-08-03T07:08:46Z

upstream PR fsspec/adlfs#420

@mitjaalge can you try installing the adlfs patch in your DVC env and then see if you get improved performance?

pip install 'adlfs @ git+https:/fsspec/adlfs.git@refs/pull/420/head'

After the patch, the concurrency behavior is still naive in that fsspec and DVC will either batch multi-file transfers at the file level or single file transfers at the chunk level, but not both in the same operation.

So for directory push/pull when the number of remaining files is less than --jobs/batch_size we could still improve performance in the future. But in the meantime the patch should fix the current worst-case scenario (which was pushing or pulling a single large file)

pmrowla · 2023-08-03T07:49:49Z

on my machine (with a residential internet connection) for a 1.5GB file I get comparable performance between azcopy and DVC with the patch

azcopy:

az storage azcopy blob upload -c peter-test-unversioned -s  -d   17.19s user 22.60s system 32% cpu 2:00.97 total

dvc main:

dvc push -v  6.82s user 5.87s system 8% cpu 2:22.52 total

dvc with the patch:

dvc push -v  8.64s user 6.91s system 13% cpu 1:57.24 total

mitjaalge · 2023-08-03T09:38:06Z

@pmrowla:
Your patch works very well for me, thank you!
The behaviour with regards to the number of jobs (-j) is not exactly how i expected, but maybe you can make sense of it.

The following table shows the upload time for N files in a folder before and after the patch. Every file has a size of 691 MB. My internet speed is 482.51 Mbit/s down and 447.74 Mbit/s up.

N files	N jobs	Patch (420)	Time
1	1	No	1min 37s
1	3	No	1min 11s
3	1	No	3min 37s
3	3	No	2min 53s
3	9	No	3min 1s
1	1	Yes	13s
1	3	Yes	13s
3	1	Yes	35s
3	3	Yes	35s
3	9	Yes	34s

BTW, your pip command didn't work for me, I had to use the following (if someone else wants to test it too):

pip install git+https:/fsspec/adlfs.git@refs/pull/420/head

pmrowla · 2023-08-03T10:17:07Z

The jobs flag isn't applied for single file transfers with that patch, that will require changes in dvc and not adlfs. So your results are what I would expect for now. (So when n files is 1 it always uses the default adlfs max_concurrency regardless of --jobs)

mitjaalge · 2023-08-03T11:17:21Z

Should it not have given a bigger speedup for multiple files without the patch. 3 files 1 job is nearly the same as 3 files 3 jobs?

pmrowla · 2023-08-03T12:02:29Z

Ah sorry, with the patch, max_concurrency is still applied to file chunks even when you use --jobs. So what you get is basically --jobs * max_concurrency

mitjaalge · 2023-08-03T13:04:58Z

Ok, makes sense.
Thank you for implementing the patch so quickly!

pmrowla · 2023-08-09T06:36:32Z

This fix is now released in adlfs and dvc-objects (so --jobs now works as expected for both single and multi file transfers)

The fix can be installed with

pip install -U adlfs dvc-objects

(it will also be available in the next DVC binary package release)

efiop transferred this issue from iterative/dvc Jul 28, 2023

pmrowla self-assigned this Aug 3, 2023

efiop mentioned this issue Aug 4, 2023

support max_concurrency in upload_blob and download_blob operations fsspec/adlfs#420

Merged

pmrowla mentioned this issue Aug 9, 2023

generic.transfer: support single file max_concurrency iterative/dvc-objects#218

Merged

pmrowla closed this as completed in iterative/dvc-objects#218 Aug 9, 2023

pmrowla mentioned this issue Feb 12, 2024

put_file: support concurrent multipart uploads with max_concurrency fsspec/s3fs#848

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

dvc push to Azure Blob storage is very slow #54

dvc push to Azure Blob storage is very slow #54

mitjaalge commented Jul 28, 2023 •

edited

Loading

shcheklein commented Jul 28, 2023

mitjaalge commented Jul 31, 2023 •

edited

Loading

mitjaalge commented Jul 31, 2023

shcheklein commented Aug 1, 2023

pmrowla commented Aug 1, 2023 •

edited

Loading

shcheklein commented Aug 1, 2023

pmrowla commented Aug 2, 2023 •

edited

Loading

shcheklein commented Aug 2, 2023

pmrowla commented Aug 2, 2023 •

edited

Loading

dberenbaum commented Aug 2, 2023

pmrowla commented Aug 3, 2023

pmrowla commented Aug 3, 2023 •

edited

Loading

pmrowla commented Aug 3, 2023

mitjaalge commented Aug 3, 2023 •

edited

Loading

pmrowla commented Aug 3, 2023 •

edited

Loading

mitjaalge commented Aug 3, 2023

pmrowla commented Aug 3, 2023

mitjaalge commented Aug 3, 2023

pmrowla commented Aug 9, 2023

dvc push to Azure Blob storage is very slow #54

dvc push to Azure Blob storage is very slow #54

Comments

mitjaalge commented Jul 28, 2023 • edited Loading

Bug Report

dvc push: is very slow when pushing to Azure Blob Storage remote

Description

Reproduce

Expected

Environment information

shcheklein commented Jul 28, 2023

mitjaalge commented Jul 31, 2023 • edited Loading

dvc doctor

pip freeze

Experiment 1: az copy vs DVC vs fsspec/adlfs

Code to upload with AzCopy

Code to upload with DVC

Code to upload with fsspec/adlfs

Experiment 2: az copy vs DVC vs azure-sdk-for-python

Code to upload with azure-sdk-for-python

mitjaalge commented Jul 31, 2023

azure-sdk-for-python upload speed:

shcheklein commented Aug 1, 2023

pmrowla commented Aug 1, 2023 • edited Loading

shcheklein commented Aug 1, 2023

pmrowla commented Aug 2, 2023 • edited Loading

shcheklein commented Aug 2, 2023

pmrowla commented Aug 2, 2023 • edited Loading

dberenbaum commented Aug 2, 2023

pmrowla commented Aug 3, 2023

pmrowla commented Aug 3, 2023 • edited Loading

pmrowla commented Aug 3, 2023

mitjaalge commented Aug 3, 2023 • edited Loading

pmrowla commented Aug 3, 2023 • edited Loading

mitjaalge commented Aug 3, 2023

pmrowla commented Aug 3, 2023

mitjaalge commented Aug 3, 2023

pmrowla commented Aug 9, 2023

mitjaalge commented Jul 28, 2023 •

edited

Loading

mitjaalge commented Jul 31, 2023 •

edited

Loading

pmrowla commented Aug 1, 2023 •

edited

Loading

pmrowla commented Aug 2, 2023 •

edited

Loading

pmrowla commented Aug 2, 2023 •

edited

Loading

pmrowla commented Aug 3, 2023 •

edited

Loading

mitjaalge commented Aug 3, 2023 •

edited

Loading

pmrowla commented Aug 3, 2023 •

edited

Loading