Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ZipArchive.Open will read non seakable stream to memory & fail with bad error on large streams #59027

Open
Tracked by #62658
ayende opened this issue Sep 13, 2021 · 6 comments
Labels
Milestone

Comments

@ayende
Copy link
Contributor

ayende commented Sep 13, 2021

Description

NetworkStream stream = await GetStreamFromUrl(remoteUrlFor_LARGE_file);
ZipArchive.Open(stream, ZipArchiveMode.Read); // <-- will fail here with `Stream was too long.`

Analysis

Reason for this is here:

extraTempStream = stream = new MemoryStream();

When we pass a non seekable stream to ZipArchive - it will read it into a MemoryStream. That will only work if the data can fit in MemoryStream. I assume that this is because the Zip format requires seeking (directory, etc).

However, this is very surprising and can cause both performance issues due to loading all contents to memory and unexpected failures if most of the data is < 2GB.

I'm not sure if there is a way to fix this, given backward compact issues.

@ayende ayende added the tenet-performance Performance related issue label Sep 13, 2021
@dotnet-issue-labeler dotnet-issue-labeler bot added area-System.IO.Compression untriaged New issue has not been triaged by the area owner labels Sep 13, 2021
@ghost
Copy link

ghost commented Sep 13, 2021

Tagging subscribers to this area: @dotnet/area-system-io-compression
See info in area-owners.md if you want to be subscribed.

Issue Details

Description

NetworkStream stream = await GetStreamFromUrl(remoteUrlFor_LARGE_file);
ZipArchive.Open(stream, ZipArchiveMode.Read); // <-- will fail here with `Stream was too long.`

Analysis

Reason for this is here:

extraTempStream = stream = new MemoryStream();

When we pass a non seekable stream to ZipArchive - it will read it into a MemoryStream. That will only work if the data can fit in MemoryStream. I assume that this is because the Zip format requires seeking (directory, etc).

However, this is very surprising and can cause both performance issues due to loading all contents to memory and unexpected failures if most of the data is < 2GB.

I'm not sure if there is a way to fix this, given backward compact issues.

Author: ayende
Assignees: -
Labels:

area-System.IO.Compression, tenet-performance, untriaged

Milestone: -

@adamsitnik
Copy link
Member

@carlossanlop this is something that we should consider for the .NET 7 Compression work as it's related to supporting large files (> 2GB)

@adamsitnik adamsitnik removed the untriaged New issue has not been triaged by the area owner label Sep 13, 2021
@adamsitnik adamsitnik added this to the 7.0.0 milestone Sep 13, 2021
@jnm2
Copy link
Contributor

jnm2 commented Sep 19, 2021

This is particularly bad when the stream being passed to the ZipArchive constructor is a 10-15 minute download and you want to be processing the contents as they become available.

Please consider ZipArchive.OpenAsync or zipArchive.GetEntriesAsync to avoid blocking on I/O, too.

The first thing ZipArchive currently does with a seekable stream is Seek(-18, SeekOrigin.End). I'm hoping that can be avoided for non-seekable streams. I'm rather worried that there's no help for it given that .zip files place the directory at the end of the stream. https://en.wikipedia.org/wiki/ZIP_(file_format)#Structure mentions that scanning for file entry headers isn't necessarily valid because a file may have been deleted from the directory. However, I would guess the vast majority of .zip files are built once and never deleted from, and thus would be packed with no unused space around the file entries.

// This seeks backwards almost to the beginning of the EOCD, one byte after where the signature would be
// located if the EOCD had the minimum possible size (no file zip comment)
_archiveStream.Seek(-ZipEndOfCentralDirectoryBlock.SizeOfBlockWithoutSignature, SeekOrigin.End);

@jnm2
Copy link
Contributor

jnm2 commented Sep 20, 2021

https://pkware.cachefly.net/webdocs/casestudies/APPNOTE.TXT mentions streaming .zip files and says that every local file header must have an entry in the central directory. So that takes care of the deleted files concern and shows that streaming is a known use case.

4.3.2 Each file placed into a ZIP file MUST be preceded by a "local file header" record for that file. Each "local file header" MUST be accompanied by a corresponding "central directory header" record within the central directory section of the ZIP file.

4.3.5 File data MAY be followed by a "data descriptor" for the file. Data descriptors are used to facilitate ZIP file streaming.

@jnm2
Copy link
Contributor

jnm2 commented Sep 21, 2021

To get myself unblocked, I created a proof of concept which successfully reads a streaming .zip file: https://gist.github.com/jnm2/31bdf08357a44c91d01736ad43b9c447

await using var reader = new StreamingZipReader(downloadStream);

while (await reader.MoveToNextEntryAsync(skipDirectories: true, CancellationToken.None))
{
    Console.WriteLine($"{reader.CurrentEntry.Name}: {reader.CurrentEntry.Length} bytes");

    using var stream = reader.GetCurrentEntryStream();
    using var testReader = new StreamReader(stream);
    var test = await testReader.ReadToEndAsync();
    // (my test download had only text files, and they all looked right!)
}

@bjornharrtell
Copy link
Contributor

FWIW with permission from @jnm2 I've published StreamingZipReader as nuget https://www.nuget.org/packages/StreamingZipReader and recently fixed a bug with regard to ZIP64 support.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

6 participants