-
Notifications
You must be signed in to change notification settings - Fork 272
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Incorrect MultiPartUpload Chunksize #789
Comments
That is most unfortunate - AWS S3 doesn't have this limitation so long as each chunk is big enough. An S3File could in theory be configured to always send a specific size (in _upload_chunk), and retain the remainder in the buffer, but it would be annoying to code and only get used by niche backends that need it (perhaps only R2). |
FWIW: arrow made a fix for this apache/arrow#34363 |
It would be nice, but the comment doesn't promise anything in the near term. |
I had the same issue, and when I added a debug line after https:/fsspec/s3fs/blob/main/s3fs/core.py#L2269 to print out the Any idea why the read to data1 doesn't return the blocksize if it's not the last part? |
In general, read() is not required to return all the bytes you request, but I don't see why an io.Bytes would ever return less. |
Could it be related to the cache? I tried different cache options but still got the same error. |
Which cache, what do you mean? |
https:/fsspec/s3fs/blob/main/s3fs/core.py#L2020 |
You have linked to two class definitions? Both have default block size of 5 * 2**20, but _upload_chunk refers consistently to self.blocksize (i.e., only one singular value) |
I'd be interested in helping to get this fixed as we're running into this in production with llama_index. |
@matthiaskern , all help is welcome |
Ran into this today as well fwiw. |
OK, so to summarise: Currently, s3fs's file will flush its whole buffer, whatever the size, whenever a write() puts it over the block size. This is fine with AWS S3 (and minio and others), but R2 requires each part to be the same size. The solution, is to allow/require S3File to always push exactly one block size at a time, potentially needing multiple remote writes at flush time, and leaving some buffer data over. Since the remote call happens only in one place, this shouldn't be hard to code up. Does someone want to take it on? It should not split writes by default, where variable part sizes are allowed. |
Description:
I have encountered an issue with the MultiPartUpload functionality in the s3fs library where the chunk size used during upload appears incorrect.
Environment Information:
Issue Details:
When using s3fs to upload large files to an S3 bucket on Cloudflare R2, I noticed that the chunk size used for MultiPartUpload needs to be consistent.
Expected Behavior:
I expect s3fs to use the same chunk size for files with the body
(b"1" * 5 * 2**30) + b"kek"
andb"1" * 5 * 2**30
Steps to Reproduce:
Working example:
This code works perfectly.
But if we change the buffer size to:
This code gives the error:
ClientError: An error occurred (InvalidPart) when calling the CompleteMultipartUpload operation: All non-trailing parts must have the same length.
Many thanks for considering my request.
The text was updated successfully, but these errors were encountered: