ZipArchive.Open will read non seakable stream to memory & fail with bad error on large streams #59027

ayende · 2021-09-13T10:01:45Z

Description

NetworkStream stream = await GetStreamFromUrl(remoteUrlFor_LARGE_file);
ZipArchive.Open(stream, ZipArchiveMode.Read); // <-- will fail here with `Stream was too long.`

Analysis

Reason for this is here:

runtime/src/libraries/System.IO.Compression/src/System/IO/Compression/ZipArchive.cs

Line 144 in 00c38c7

extraTempStream = stream = new MemoryStream();

When we pass a non seekable stream to ZipArchive - it will read it into a MemoryStream. That will only work if the data can fit in MemoryStream. I assume that this is because the Zip format requires seeking (directory, etc).

However, this is very surprising and can cause both performance issues due to loading all contents to memory and unexpected failures if most of the data is < 2GB.

I'm not sure if there is a way to fix this, given backward compact issues.

The text was updated successfully, but these errors were encountered:

ghost · 2021-09-13T10:01:51Z

Tagging subscribers to this area: @dotnet/area-system-io-compression
See info in area-owners.md if you want to be subscribed.

Issue Details

Description

NetworkStream stream = await GetStreamFromUrl(remoteUrlFor_LARGE_file);
ZipArchive.Open(stream, ZipArchiveMode.Read); // <-- will fail here with `Stream was too long.`

Analysis

Reason for this is here:

runtime/src/libraries/System.IO.Compression/src/System/IO/Compression/ZipArchive.cs

Line 144 in 00c38c7

extraTempStream = stream = new MemoryStream();

When we pass a non seekable stream to ZipArchive - it will read it into a MemoryStream. That will only work if the data can fit in MemoryStream. I assume that this is because the Zip format requires seeking (directory, etc).

However, this is very surprising and can cause both performance issues due to loading all contents to memory and unexpected failures if most of the data is < 2GB.

I'm not sure if there is a way to fix this, given backward compact issues.

Author:	ayende
Assignees:	-
Labels:	`area-System.IO.Compression`, `tenet-performance`, `untriaged`
Milestone:	-

adamsitnik · 2021-09-13T10:12:00Z

@carlossanlop this is something that we should consider for the .NET 7 Compression work as it's related to supporting large files (> 2GB)

jnm2 · 2021-09-19T19:17:40Z

This is particularly bad when the stream being passed to the ZipArchive constructor is a 10-15 minute download and you want to be processing the contents as they become available.

Please consider ZipArchive.OpenAsync or zipArchive.GetEntriesAsync to avoid blocking on I/O, too.

The first thing ZipArchive currently does with a seekable stream is Seek(-18, SeekOrigin.End). I'm hoping that can be avoided for non-seekable streams. I'm rather worried that there's no help for it given that .zip files place the directory at the end of the stream. https://en.wikipedia.org/wiki/ZIP_(file_format)#Structure mentions that scanning for file entry headers isn't necessarily valid because a file may have been deleted from the directory. However, I would guess the vast majority of .zip files are built once and never deleted from, and thus would be packed with no unused space around the file entries.

runtime/src/libraries/System.IO.Compression/src/System/IO/Compression/ZipArchive.cs

Lines 520 to 522 in 78593b9

 // This seeks backwards almost to the beginning of the EOCD, one byte after where the signature would be 

 // located if the EOCD had the minimum possible size (no file zip comment) 

 _archiveStream.Seek(-ZipEndOfCentralDirectoryBlock.SizeOfBlockWithoutSignature, SeekOrigin.End);

jnm2 · 2021-09-20T23:52:54Z

https://pkware.cachefly.net/webdocs/casestudies/APPNOTE.TXT mentions streaming .zip files and says that every local file header must have an entry in the central directory. So that takes care of the deleted files concern and shows that streaming is a known use case.

4.3.2 Each file placed into a ZIP file MUST be preceded by a "local file header" record for that file. Each "local file header" MUST be accompanied by a corresponding "central directory header" record within the central directory section of the ZIP file.

4.3.5 File data MAY be followed by a "data descriptor" for the file. Data descriptors are used to facilitate ZIP file streaming.

jnm2 · 2021-09-21T01:57:54Z

To get myself unblocked, I created a proof of concept which successfully reads a streaming .zip file: https://gist.github.com/jnm2/31bdf08357a44c91d01736ad43b9c447

await using var reader = new StreamingZipReader(downloadStream);

while (await reader.MoveToNextEntryAsync(skipDirectories: true, CancellationToken.None))
{
    Console.WriteLine($"{reader.CurrentEntry.Name}: {reader.CurrentEntry.Length} bytes");

    using var stream = reader.GetCurrentEntryStream();
    using var testReader = new StreamReader(stream);
    var test = await testReader.ReadToEndAsync();
    // (my test download had only text files, and they all looked right!)
}

bjornharrtell · 2024-05-26T08:55:05Z

FWIW with permission from @jnm2 I've published StreamingZipReader as nuget https://www.nuget.org/packages/StreamingZipReader and recently fixed a bug with regard to ZIP64 support.

ayende added the tenet-performance Performance related issue label Sep 13, 2021

dotnet-issue-labeler bot added area-System.IO.Compression untriaged New issue has not been triaged by the area owner labels Sep 13, 2021

adamsitnik removed the untriaged New issue has not been triaged by the area owner label Sep 13, 2021

adamsitnik added this to the 7.0.0 milestone Sep 13, 2021

ayende mentioned this issue Sep 13, 2021

RavenDB-17199 GetZipArchiveForSnapshot() must use a temporary file in… ravendb/ravendb#12756

Merged

carlossanlop mentioned this issue Nov 18, 2021

ZipArchive: Apply strategy pattern depending on ZipArchiveMode #61820

Closed

carlossanlop mentioned this issue Dec 10, 2021

System.IO.Compression work planned for .NET 7 #62658

Closed

28 tasks

jeffhandley modified the milestones: 7.0.0, 8.0.0 Jul 9, 2022

Cyberboss mentioned this issue Jun 9, 2023

Modify IIOManager.ZipToDirectory to accept unseekable streams without buffering tgstation/tgstation-server#1531

Open

carlossanlop modified the milestones: 8.0.0, Future Jul 20, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ZipArchive.Open will read non seakable stream to memory & fail with bad error on large streams #59027

ZipArchive.Open will read non seakable stream to memory & fail with bad error on large streams #59027

ayende commented Sep 13, 2021

ghost commented Sep 13, 2021

Description

Analysis

adamsitnik commented Sep 13, 2021

jnm2 commented Sep 19, 2021 •

edited

Loading

jnm2 commented Sep 20, 2021

jnm2 commented Sep 21, 2021 •

edited

Loading

bjornharrtell commented May 26, 2024

ZipArchive.Open will read non seakable stream to memory & fail with bad error on large streams #59027

ZipArchive.Open will read non seakable stream to memory & fail with bad error on large streams #59027

Comments

ayende commented Sep 13, 2021

Description

Analysis

ghost commented Sep 13, 2021

Description

Analysis

adamsitnik commented Sep 13, 2021

jnm2 commented Sep 19, 2021 • edited Loading

jnm2 commented Sep 20, 2021

jnm2 commented Sep 21, 2021 • edited Loading

bjornharrtell commented May 26, 2024

jnm2 commented Sep 19, 2021 •

edited

Loading

jnm2 commented Sep 21, 2021 •

edited

Loading