Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

s3 performance is slow #23

Closed
tyommik opened this issue Apr 25, 2021 · 18 comments
Closed

s3 performance is slow #23

tyommik opened this issue Apr 25, 2021 · 18 comments
Assignees

Comments

@tyommik
Copy link

tyommik commented Apr 25, 2021

Bug Report

s3 performance is slow

Description

Ticket is based on topic in discord (“need-help” channel).
The problem is that I tried to use DVC with miniO (s3 compatible storage) and noticed that its performance is very slow.

My env:

  • miniO storage location is on SSD (Intel Optane, high performance)
  • MiniO server <------ 1000Mbit/sec --------> DVC (s3 client)
  • Bucket 40Gb - 410k files (each file <= 400KB)
DVC version: 2.0.17 (deb)
---------------------------------
Platform: Python 3.8.8 on Linux-5.4.0-70-generic-x86_64-with-glibc2.4
Supports: All remotes
Cache types: hardlink, symlink
Cache directory: ext4 on /dev/sda
Caches: local
Remotes: s3
Workspace directory: ext4 on /dev/sda
Repo: dvc, git

When I did dvc pull -j 20 maximum speed was 80 mbit/sec but average was about 40. I download the bucket for 220 minutes.

What I tried else:
dvc pull -j 80 - no improvements.
awscli - aws tool can maximum 160 Mbit/sec downloading speed. I tried different settings, but I couldn't exceed the limit.
s4cmd - maximum I got is about 130 Mbit/sec, 64minutes to get the bucket.
s5cmd - maximum 960 Mbit/sec and less than 10 minutes to download the whole bucket (GoLang)

So you can see that storage performance is okay but the download speed of tools written on python can not reach maximum.

Reproduce

Profiler stat: https://disk.yandex.ru/d/XNajwHgWYlPSHA

@pared
Copy link
Contributor

pared commented Apr 25, 2021

Related iterative/dvc#5683

@efiop efiop transferred this issue from iterative/dvc Jan 1, 2023
@diehl
Copy link

diehl commented Feb 21, 2023

FYI I'm seeing very slow transfer speeds to S3 when doing a dvc push. In comparison, when doing a aws s3 cp, I'm seeing throughput that is about 10-100x faster from the same machine.

Here's the output from dvc doctor on my machine.

DVC version: 2.45.0 (brew)
--------------------------
Platform: Python 3.11.2 on macOS-12.6.3-x86_64-i386-64bit
Subprojects:
	dvc_data = 0.40.1
	dvc_objects = 0.19.3
	dvc_render = 0.1.2
	dvc_task = 0.1.11
	dvclive = 2.0.2
	scmrepo = 0.1.9
Supports:
	azure (adlfs = 2023.1.0, knack = 0.10.1, azure-identity = 1.12.0),
	gdrive (pydrive2 = 1.15.0),
	gs (gcsfs = 2023.1.0),
	http (aiohttp = 3.8.4, aiohttp-retry = 2.8.3),
	https (aiohttp = 3.8.4, aiohttp-retry = 2.8.3),
	oss (ossfs = 2021.8.0),
	s3 (s3fs = 2023.1.0, boto3 = 1.24.59),
	ssh (sshfs = 2023.1.0),
	webdav (webdav4 = 0.9.8),
	webdavs (webdav4 = 0.9.8),
	webhdfs (fsspec = 2023.1.0)
Cache types: reflink, hardlink, symlink
Cache directory: apfs on /dev/disk3s1s1
Caches: local
Remotes: s3
Workspace directory: apfs on /dev/disk3s1s1
Repo: dvc, git

@efiop
Copy link
Contributor

efiop commented Jul 15, 2023

@diehl Are you also using minio?

@diehl
Copy link

diehl commented Jul 15, 2023

@efiop I'm not. I don't know what minio is.

@efiop
Copy link
Contributor

efiop commented Jul 15, 2023

@diehl So real aws s3 then, right?

@diehl
Copy link

diehl commented Jul 15, 2023

@efiop that is correct. using the AWS CLI.

@efiop
Copy link
Contributor

efiop commented Aug 30, 2023

@diehl Btw, what does the data that you have look like? Whole directories tracked by dvc? How many files there typically, what approx size? IIRC you were talking about geojson before, so I suppose thousands of ~100M files?

@efiop
Copy link
Contributor

efiop commented Aug 30, 2023

Chatted with @diehl privately. For the record, there are two directories, one 5G total with 224 files and the other is 14G total with 63 files. Inside there are misc files, with 5G being the biggest one in the second dataset. So likely we need to look at a scenario of transferring individual large files.

Need to reproduce locally and see if there is anything on the surface.

@pmrowla
Copy link
Contributor

pmrowla commented Aug 31, 2023

This is probably an s3fs issue similar to the adlfs one we had recently.

s3fs _put_file()/_get_file() do support chunked (multipart) upload/downloads, but chunks within a single file are always done in sequence rather than in parallel. s3fs should probably be using asyncio.gather() or fsspecs run_coros_in_chunks() to batch the chunk transfers, and then reassemble them at the end as needed. I'm guessing this hasn't been addressed because doing the chunk transfers in parallel requires balancing those batched tasks with the file-level batching.

@efiop
Copy link
Contributor

efiop commented Aug 31, 2023

@pmrowla Do you think we should prioritize it and solve it? What would be your estimation there?

@pmrowla
Copy link
Contributor

pmrowla commented Sep 1, 2023

Handling it the same way we did with adlfs (where we only optimize the worst-case single large file upload/download scenario) should be relatively quick to do in s3fs, but I should note that it looks like they have tried to do concurrent multipart uploads before in some of the s3fs calls but ran into problems that made them revert to only sequential operations: https:/fsspec/s3fs/blob/b1d98806952485be86379f0f4574ee4de24568a1/s3fs/core.py#L1768C31-L1769

@dberenbaum
Copy link
Contributor

It looks like it multipart upload requires a minimum size of 5 MiB for all but the last part: https://docs.aws.amazon.com/AmazonS3/latest/userguide/qfacts.html

Also, to set expectations, the azure approach currently will only help when doing dvc push on a single file right @pmrowla?

@pmrowla
Copy link
Contributor

pmrowla commented Sep 2, 2023

Also, to set expectations, the azure approach currently will only help when doing dvc push on a single file right @pmrowla?

Yes that's correct

@pmrowla
Copy link
Contributor

pmrowla commented Feb 12, 2024

Should be resolved by fsspec/s3fs#848

There's some raw s3fs numbers in the upstream PR, but for reference with DVC and a single file that takes aws s3 cp ~40 seconds to upload:

with s3fs main:

time dvc push -r s3-us
Collecting                                                                                                                                                                     |1.00 [00:00,  952entry/s]
Pushing
1 file pushed
dvc push -r s3-us  10.49s user 4.12s system 13% cpu 1:51.14 total

with s3fs PR and default --jobs (40 on my machine):

time dvc push -r s3-us
Collecting                                                                                                                                                                     |1.00 [00:00,  930entry/s]
Pushing
1 file pushed
dvc push -r s3-us  7.14s user 3.01s system 27% cpu 36.642 total

@pmrowla
Copy link
Contributor

pmrowla commented Feb 28, 2024

This is merged upstream and will be available in the next s3fs release

@pmrowla pmrowla closed this as completed Feb 28, 2024
@diehl
Copy link

diehl commented Feb 28, 2024

@pmrowla do you have an ETA for the next s3fs release by chance?

@pmrowla
Copy link
Contributor

pmrowla commented Feb 28, 2024

I don't have an ETA but generally the fsspec/s3fs maintainers are fairly quick about doing releases for work we've contributed upstream

cc @efiop

@diehl
Copy link

diehl commented Feb 28, 2024

Roger that - thanks @pmrowla

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
No open projects
Archived in project
Development

No branches or pull requests

6 participants