Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for zlib data format (RFC 1950) #2236

Closed
Ryder25 opened this issue Jan 27, 2020 · 32 comments · Fixed by #42717
Closed

Add support for zlib data format (RFC 1950) #2236

Ryder25 opened this issue Jan 27, 2020 · 32 comments · Fixed by #42717
Assignees
Labels
api-approved API was approved in API review, it can be implemented area-System.IO.Compression help wanted [up-for-grabs] Good issue for external contributors
Milestone

Comments

@Ryder25
Copy link

Ryder25 commented Jan 27, 2020

Currently .Net doesn't support the zlib data format despite the majority of the work having been done by DeflateStream. I propose a new class be added similar to GZipStream to support RFC 1950.

Personally I'm encountering the zlib format in two areas. Many games depend on zlib, and save files using this format. We also use zlib in the firmware of one of our hardware products. We save debug data and compress it with zlib. For both of these scenarios, we use C# tooling, on the desktop, to operate on these files whether it is for the purpose of reading, modifying, or writing.

Due to the lack of support in .Net I either have to use lame hacky methods, or rely on a separate third party library for the functionality. This is such a shame when .Net already provides the majority of what's needed, and it should be trivial (as far I can tell) to include this as part of .Net.

While decompression is quite simple to do by skipping the 2-6 byte header, and 4 byte CRC at the end, I have trouble with compression. I wouldn't know what the proper header should be when using DeflateStream to do the compression, not to mention having to include a computation for the Adler32 checksum.

Here's a quick untested example of what I currently have to do for decompression to give you an idea of the "lame hacky-ness": (Note: I'm not checking the checksum here. I'm assuming the data is good.)

private byte[] ZlibDecompress(byte[] data)
{
    using (var ms = new MemoryStream())
    using (var ds = new DeflateStream(ms, CompressionMode.Decompress))
    using (var msOut = new MemoryStream())
    {
        int headerSize = (data[1] & 0b10000) == 0b10000 ? 6 : 2;
        ms.Write(data, headerSize, data.Length - 4 - headerSize);
        ms.Seek(0, SeekOrigin.Begin);
        ds.CopyTo(msOut);
        return msOut.ToArray();
     }
}

Proposed API

public class ZLibStream : Stream
{
    public ZLibStream(Stream stream, CompressionLevel compressionLevel);
    public ZLibStream(Stream stream, CompressionMode mode);
    public ZLibStream(Stream stream, CompressionLevel compressionLevel, bool leaveOpen);
    public ZLibStream(Stream stream, CompressionMode mode, bool leaveOpen);
    public override bool CanWrite { get; }
    public override bool CanSeek { get; }
    public override bool CanRead { get; }
    public Stream BaseStream { get; }
    public override long Length { get; }
    public override long Position { get; set; }
    public override void CopyTo(Stream destination, int bufferSize);
    public override Task CopyToAsync(Stream destination, int bufferSize, CancellationToken cancellationToken);
    public override ValueTask DisposeAsync();
    public override void Flush();
    public override Task FlushAsync(CancellationToken cancellationToken);
    public override int Read(Span<byte> buffer);
    public override int Read(byte[] array, int offset, int count);
    public override ValueTask<int> ReadAsync(Memory<byte> buffer, CancellationToken cancellationToken = default);
    public override Task<int> ReadAsync(byte[] array, int offset, int count, CancellationToken cancellationToken);
    public override int ReadByte();
    public override long Seek(long offset, SeekOrigin origin);
    public override void SetLength(long value);
    public override void Write(byte[] array, int offset, int count);
    public override void Write(ReadOnlySpan<byte> buffer);
    public override Task WriteAsync(byte[] array, int offset, int count, CancellationToken cancellationToken);
    public override ValueTask WriteAsync(ReadOnlyMemory<byte> buffer, CancellationToken cancellationToken = default);
    protected override void Dispose(bool disposing);
}
@JimBobSquarePants
Copy link

If adding ZLibStream it would be great to have access to the incremental 1-9 compression levels.

@kamronbatman
Copy link
Contributor

kamronbatman commented May 6, 2020

I am forced to include a Zlib wrapper project and build a local nuget to keep my project as portable as possible.

We support Windows Server 2016/2019, Ubuntu 16/18/20LTS, CentOS 8.1, and Debian 9/10 against .NET Core 3.1.

Due to high concurrency and throughout requirements I can't use a fully managed solution or streams that have allocations. I am working on backend projects that utilize existing game client protocols, so It would be nice if I could reference the zlib native library directly, or a native library nuget is exposed (like libuv).

Lastly, I don't have access to modify the client code so I can't include headers/footers, nor would I want to.

I was looking forward to zlib implementation since ~2007 for multiple projects, but it looks like it is still not usable for my use cases.

@stephentoub stephentoub added api-ready-for-review and removed untriaged New issue has not been triaged by the area owner labels Jun 24, 2020
@stephentoub stephentoub added this to the 5.0.0 milestone Jun 24, 2020
@danmoseley
Copy link
Member

@stephentoub unless this is committed for the 5.0 release, we should mark Future. It can still happen for 5.0, but milestone 5.0 == "debt to pay off" to successfully release.

@stephentoub
Copy link
Member

stephentoub commented Jul 2, 2020

We have a bug #38022. That needs to be fixed in some way for .NET 5. Implementing this is the right way to fix it. If we choose to punt on this, we can do the short term workaround there (which isn't a fix so much as disabling the functionality), but that's why I marked this as 5.0, to make a decision on the right thing to do. From my perspective, this whole thing can be done in a day.

@danmoseley
Copy link
Member

Oh that's a good reason for the milestone then. 😸

@stephentoub
Copy link
Member

For reference, other than tests, this is what the solution looks like:
stephentoub@8f73f68

@JimBobSquarePants
Copy link

@kamronbatman I've been developing a managed solution that is much faster than any other managed implementation (sometimes faster than DeflateStream) and allocates the same as DeflateStream. Dunno if it's of any interest as the planned implementation here does not expose many of the useful zlib options. (levels + mode)

I'm not finished cleaning up yet but will release something soonish.
https:/SixLabors/ZlibStream

@kamronbatman
Copy link
Contributor

Thanks @JimBobSquarePants, I have been wrapping native zlib for my own purposes into a cross-platform compatible nuget. I'll definitely take a look at what you wrote and hopefully it can be used. One advantage of my use case is that the data is small, so streaming is not even needed.

@danmosemsft, I hear quite a bit about .NET 5 milestones, how does it work for .NET Core? Is it a separate implementation that is needed?

@danmoseley
Copy link
Member

@danmosemsft, I hear quite a bit about .NET 5 milestones, how does it work for .NET Core? Is it a separate implementation that is needed?

.NET 5 is the next version of .NET Core (it's what one you might otherwise expect would be named .NET Core 4.0). More info here https://devblogs.microsoft.com/dotnet/introducing-net-5/ ...

Milestone is described here #38286

@carlossanlop carlossanlop modified the milestones: 5.0.0, Future Jul 6, 2020
@carlossanlop carlossanlop added api-ready-for-review API is ready for review, it is NOT ready for implementation and removed api-ready-for-review labels Jul 6, 2020
@terrajobst terrajobst added api-approved API was approved in API review, it can be implemented and removed api-ready-for-review API is ready for review, it is NOT ready for implementation labels Jul 16, 2020
@terrajobst
Copy link
Member

terrajobst commented Jul 16, 2020

Video

  • Looks good as proposed.
namespace System.IO.Compression
{
    public class ZLibStream : Stream
    {
        public ZLibStream(Stream stream, CompressionLevel compressionLevel);
        public ZLibStream(Stream stream, CompressionMode mode);
        public ZLibStream(Stream stream, CompressionLevel compressionLevel, bool leaveOpen);
        public ZLibStream(Stream stream, CompressionMode mode, bool leaveOpen);
        public Stream BaseStream { get; }
    }
}

@danmoseley
Copy link
Member

@carlossanlop note that @stephentoub is on leave for a while (just in case you were assuming he'd complete his change above). Perhaps a community member is interested in taking it to PR.

@carlossanlop carlossanlop added the help wanted [up-for-grabs] Good issue for external contributors label Jul 16, 2020
@kamronbatman
Copy link
Contributor

Question, will this still generate headers/footers? If so, will there be an option to not?

@JimBobSquarePants
Copy link

Looks like I'll be carrying on my my implementation. This misses too many features.

@danmoseley danmoseley modified the milestones: 5.0.0, Future Jul 21, 2020
@stephentoub
Copy link
Member

This misses too many features.

@JimBobSquarePants, what features specifically? Just "levels + mode" as you cited earlier, or something more / deeper? Thanks.

@JimBobSquarePants
Copy link

@stephentoub I'd consider the lack of compression levels and strategy to be a significant missing features.

@stephentoub
Copy link
Member

I'm not arguing about significance, rather I'm trying to understand the scope of the key things you believe are significant. i.e. Is your primary concern the lack of a constructor like:

public ZLibStream(Stream stream, SomeType compressionLevel, SomeOtherType strategy);

or does it go beyond that?

@JimBobSquarePants
Copy link

If constructors like the one in your example were available offering the full range of compression levels and strategy I would consider that sufficient for my use case.

However, I cannot speak for others who would like to utilize additional zlib stream functionality like flushing strategies.

@ptasev
Copy link

ptasev commented Sep 23, 2020

Someone also requested the ability to read metadata in another thread, maybe it is related:

#16923 (comment)

@stephentoub
Copy link
Member

Thanks. Since this issue has had its api approved and the additional surface area you're referring to then would be pure addition on top of that (e.g. new ctors, new flush methods, etc.), we can go ahead and implement this and then discuss additional support separately. Please open a separate issue with the additional APIs proposed you're hoping to see. Thanks!

@JimBobSquarePants
Copy link

@stephentoub How would you suggest compression levels 0-9 be added? You'll now have two separate enumerations that represent sets of the same property?

I've watched the API review video and it's painfully obvious that the individuals doing the review have little understanding of zlib yet they identify that the CompressionLevel enum could be troublesome.

https://www.youtube.com/watch?v=7YpDyRMaDKE&t=1h17m28s

Why not implement the API properly from the start?

@stephentoub
Copy link
Member

stephentoub commented Sep 25, 2020

How would you suggest compression levels 0-9 be added?

In a way that's applicable to DeflateStream and GzipStream and BrotliStream as well. The features being requested here cross the other streams that have a similar API, whether it be additional overloads of Flush that accept a strategy value, or a ctor that takes an integer value for a compression level, etc.

I understand you don't like it. But there's a lot to be said for consistency with the existing APIs that have worked well for many people for years. And then extending them all to account for the additional support desired in a consistent fashion.

You'll now have two separate enumerations that represent sets of the same property?

The enum you cite already exists, and needs to work with the new type. And if we added a new enum, it would need to work with the existing types, too.

@JimBobSquarePants
Copy link

Yes, it already exists and it's a poor representation of the available values in almost every instance it have been forced into with different meanings for what Optimal actually represents.

For example:
Brotli = 11 (Maximum)
Zlib = 6 (Default middle ground)

There was some flipflopping on the implementation in these for Zlib.
dotnet/corefx#4589
dotnet/corefx#5458

Of course, the docs say both should represent the best possible compression.

The compression operation should be optimally compressed, even if the operation takes a longer time to complete.

https://docs.microsoft.com/en-us/dotnet/api/system.io.compression.compressionlevel?redirectedfrom=MSDN&view=netcore-3.1#fields

But now there's SmallestSize?
#41960

For Brotli it appears you can only set the compression to a custom value using the separate BrotliEncoder struct which adds an unexpected level of level of misdirection. Why that was chosen over allowing the passing of an int or better yet an enum I don't know.

I would imagine, for consistency, I would have to suggest something similar.

It's not just a case of me not liking something. The leaky abstraction that is DeflateStream and CompressionLevel is a mess and adding further implementations based upon it only makes things worse.

@stephentoub
Copy link
Member

stephentoub commented Sep 25, 2020

the docs say both should represent the best possible compression

Thankfully, docs are mutable 😉 If they need to be improved, they can be. Please feel free to open issues and/or PRs in https:/dotnet/dotnet-api-docs.

it's a poor representation of the available values

It's not meant to be a perfect representation of all available values. It's meant to be a high-level option that provides some expression of preference. You're right that it lacks full fidelity to a particular underlying implementation; that was never the goal. And from what I've seen, it's sufficient for the majority of developers / needs, while also being easier than having to understand passing an arbitrary number and knowing what it means.

For example: Brotli = 11 (Maximum) Zlib = 6 (Default middle ground)

And they can be revisited as needed. With SmallestSize being added, it's possible changing Optimal is the right answer for Brotli; I doubt there are significant dependencies on the exact behavior, but if there are, well, one of the nice things about .NET Core vs .NET Framework is Core is more accepting of that level of breaking change between major versions. Optimal is meant to mean a good all-around choice.

DeflateStream and CompressionLevel is a mess and adding further implementations based upon it only makes things worse.

That is your opinion; I respectfully disagree. From my perspective, these higher-level Stream APIs make it straightforward for someone to either simply not care at all about the level of compression employed and trust in the system to do something reasonable:

new BrotliStream(stream, CompressionMode.Compress);
new DeflateStream(stream, CompressionMode.Compress);
new GZipStream(stream, CompressionMode.Compress);
new ZLibStream(stream, CompressionMode.Compress);

or to exert a preference for speed vs size, with options on the scale from NoCompression to Fastest to Optimal (which may have been better named as Balanced or Default or something like that) to SmallestSize:

new BrotliStream(stream, CompressionLevel.NoCompression);
new DeflateStream(stream, CompressionLevel.Fastest);
new GZipStream(stream, CompressionLevel.Optimal);
new ZLibStream(stream, CompressionLevel.SmallestSize);

Yes, in both cases, the developer is relying on the implementation respecting those preferences, and it does. And yes, the API doesn't expose the nitty-gritty of every possible variation the underlying implementation enables; again, that was by-design, and I do not see how that makes it "wrong" or "a mess". And yes, additional APIs could be considered that did expose the more fine-grained knobs.

Which is exactly what was done when the later, lower-level, more advanced BrotliEncoder and BrotliDecoder were exposed, for scenarios where a developer is more interested in managing more themselves. I don't know if those APIs have proven to be useful or not, but if they have been and the scenarios exist for this, I don't see anything wrong with providing parallel ZLibEncoder and ZLibDecoder structs; that would be reasonable to propose.

@xoofx
Copy link
Member

xoofx commented Sep 25, 2020

Which is exactly what was done when the later, lower-level, more advanced BrotliEncoder and BrotliDecoder were exposed, for scenarios where a developer is more interested in managing more themselves. I don't know if those APIs have proven to be useful or not, but if they have been and the scenarios exist for this, I don't see anything wrong with providing parallel ZLibEncoder and ZLibDecoder structs; that would be reasonable to propose.

I concur with that approach. Instead of exposing the underlying libraries only through restrictive opinionated API/knobs, providing access to bare minimum low level wrappers for each compression libs would help more fine grained control.

@kamronbatman
Copy link
Contributor

A ZLibEncoder with no headers/footers which ran at least close to as fast as the native zlib library would work for my use-case. I can't really justify the use of a stream, especially since most buffers are stackalloc or rented memory of <= 64k.

@JimBobSquarePants
Copy link

JimBobSquarePants commented Sep 28, 2020

The point I'm trying to make is that if I've chosen a specific compression algorithm I am already in the advanced scenario. I've reviewed each format and have chosen one based upon my requirements. I should now be able to use the basic features of that format without having to use a separate struct. It's like choosing jpeg over png, or json over xml.

I have no issue at all with having separate structs that work directly against a span for specific high-performance scenarios. I simply consider them a separate concern from a basic ZLibStream implementation that should be designed and implemented separately in their own time.

I strongly believe that the addition of constructors and properties that allow basic Zlib functionality should not require opening a new issue.

So that said I would like to have the following additional constructor:

public ZLibStream(Stream stream, ZlibCompressionLevel compressionLevel, ZlibCompressionStrategy strategy)

I've chosen to omit the option for suppressing a header for now as that is considered an advanced use case for Zlib where passing a negative value for windowBits omits the header and may be a better fit for the separate structs scenario.

Where ZlibCompressionLevel and ZlibCompressionStrategy are the following. You already have partial implementations in the codebase as CompressionLevel and CompressionStrategy in ZlibNative.

public enum ZlibCompressionLevel : int
{
    DefaultCompression = -1,

    Level0 = 0,

    NoCompression = Level0,

    Level1 = 1,

    BestSpeed = Level1,

    Level2 = 2,

    Level3 = 3,

    Level4 = 4,

    Level5 = 5,

    Level6 = 6,

    Level7 = 7,

    Level8 = 8,

    Level9 = 9,

    BestCompression = Level9,
}
public enum ZlibCompressionStrategy : int
{
    DefaultStrategy = 0,

    Filtered = 1,

    HuffmanOnly = 2,

    Rle = 3,

    Fixed = 4
}

There's also an additional property that should be considered:

public ZlibFlushMode FlushMode { get; set; }

Defined as the following. There's already a partial implementation, ZFlushCode in the current codebase in in ZlibNative.

public enum ZlibFlushMode : int
{
    NoFlush = 0,

    PartialFlush = 1,

    SyncFlush = 2,

    FullFlush = 3,

    Finish = 4,
}

This property should be passed to every call to call to the native deflate and inflate methods.

Below is the documentation from ZLib regarding the property.

Normally the parameter flush is set to Z_NO_FLUSH, which allows deflate to decide how much data to accumulate before producing output, in order to maximize compression.

If the parameter flush is set to Z_SYNC_FLUSH, all pending output is flushed to the output buffer and the output is aligned on a byte boundary, so that the decompressor can get all input data available so far. (In particular avail_in is zero after the call if enough output space has been provided before the call.) Flushing may degrade compression for some compression algorithms and so it should be used only when necessary. This completes the current deflate block and follows it with an empty stored block that is three bits plus filler bits to the next byte, followed by four bytes (00 00 ff ff).

If flush is set to Z_PARTIAL_FLUSH, all pending output is flushed to the output buffer, but the output is not aligned to a byte boundary. All of the input data so far will be available to the decompressor, as for Z_SYNC_FLUSH. This completes the current deflate block and follows it with an empty fixed codes block that is 10 bits long. This assures that enough bytes are output in order for the decompressor to finish the block before the empty fixed codes block.

If flush is set to Z_BLOCK, a deflate block is completed and emitted, as for Z_SYNC_FLUSH, but the output is not aligned on a byte boundary, and up to seven bits of the current block are held to be written as the next byte after the next deflate block is completed. In this case, the decompressor may not be provided enough bits at this point in order to complete decompression of the data provided so far to the compressor. It may need to wait for the next block to be emitted. This is for advanced applications that need to control the emission of deflate blocks.

If flush is set to Z_FULL_FLUSH, all output is flushed as with Z_SYNC_FLUSH, and the compression state is reset so that decompression can restart from this point if previous compressed data has been damaged or if random access is desired. Using Z_FULL_FLUSH too often can seriously degrade compression.

If deflate returns with avail_out == 0, this function must be called again with the same value of the flush parameter and more output space (updated avail_out), until the flush is complete (deflate returns with non-zero avail_out). In the case of a Z_FULL_FLUSH or Z_SYNC_FLUSH, make sure that avail_out is greater than six to avoid repeated flush markers due to avail_out == 0 on return.

If the parameter flush is set to Z_FINISH, pending input is processed, pending output is flushed and deflate returns with Z_STREAM_END if there was enough output space. If deflate returns with Z_OK or Z_BUF_ERROR, this function must be called again with Z_FINISH and more output space (updated avail_out) but no more input data, until it returns with Z_STREAM_END or an error. After deflate has returned Z_STREAM_END, the only possible operations on the stream are deflateReset or deflateEnd.

Z_FINISH can be used in the first deflate call after deflateInit if all the compression is to be done in a single step. In order to complete in one call, avail_out must be at least the value returned by deflateBound (see below). Then deflate is guaranteed to return Z_STREAM_END. If not enough output space is provided, deflate will not return Z_STREAM_END, and it must be called again as described above.

As a side note I would suggest similar additions should be made to GZipStream and BrotliStream however I have no expertize in that area so cannot suggest what should be added other than individual compressionlevels.

@stephentoub
Copy link
Member

stephentoub commented Sep 28, 2020

I strongly believe that the addition of constructors and properties that allow basic Zlib functionality should not require opening a new issue.

This issue was opened in Jan. It was approved in July. There is already a PR out implementing it. The additional suggestions, which don't negate any of the existing API, weren't explicitly proposed until just now (you commented previously briefly about desired functionality, but without any specificity or background), will still need to be iterated on, and will need be to be reviewed separately. And it should ideally be done in way where all the compression streams gain similar functionality, even if the respective ctors take slightly different arguments or give slightly different meanings to those arguments. Please open a new issue dedicated to the new surface area being proposed; it is the process we utilize in this repo. Thank you.

cc: @terrajobst

@JimBobSquarePants
Copy link

#2236 (comment)

Now you're just being obtuse. Nobody replied to my initial comment.

@terrajobst
Copy link
Member

I think we'd like to standardize this across all the compression streams (Brotli, Deflate, Gzip and Zlib). I think it would be good to make this a separate issue because ideally all streams can follow a similar pattern. For example, we could decide to have separate enums or we could decide to have a XxxOptions where the the level is just a member, maybe even of type int. I have filled #42820 to track this.

There is also a desire is to unblock the PR for ZLibStream that because keeping the PR open is just pain for us (because the longer we wait the harder it gets to merge).

@shravan2x
Copy link

shravan2x commented Oct 13, 2020

@terrajobst @stephentoub Since 3.1 is LTS, is there any chance of backported support for this?

EDIT: I would understand if y'all just don't want to spend the extra time on this small feature though.

@terrajobst
Copy link
Member

terrajobst commented Oct 13, 2020

Unless there is a super strong business reason we generally don't backport features/new APIs. In this case the answer is most likely no.

@terrajobst
Copy link
Member

terrajobst commented Oct 13, 2020

Oh GitHub buttons. One day I will not mess them up. But today is not that day.

@ghost ghost locked as resolved and limited conversation to collaborators Dec 11, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
api-approved API was approved in API review, it can be implemented area-System.IO.Compression help wanted [up-for-grabs] Good issue for external contributors
Projects
None yet
Development

Successfully merging a pull request may close this issue.