Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support for tar archives #1540

Closed
bogdanteleaga opened this issue Sep 15, 2015 · 24 comments
Closed

Support for tar archives #1540

bogdanteleaga opened this issue Sep 15, 2015 · 24 comments
Labels
api-suggestion Early API idea and discussion, it is NOT ready for implementation area-System.IO.Compression User Story A single user-facing feature. Can be grouped under an epic.
Milestone

Comments

@bogdanteleaga
Copy link

bogdanteleaga commented Sep 15, 2015

Update: New proposal here: #65951

Summary

The TAR archive format is commonly used in Unix/Linux-native workloads. .NET applications should be able to produce and consume these archives with built-in APIs that support the most frequently used TAR features and variations.

API Proposal

Reading APIs

We could gradually offer functionality. Initially, we must offer APIs that can read archives.

namespace System.IO.Compression
{
    public class TarArchive : IDisposable
    {
        public TarOptions Options { get; }
        public TarArchive(Stream stream, TarOptions? options);
        public bool TryGetNextEntry(out TarArchiveEntry? entry);
        public void Dispose();
        protected virtual void Dispose(bool disposing);
    }
    public class TarArchiveEntry
    {
        public TarArchiveEntry(TarArchive archive, string fullName);
        public string FullName { get; }
        public string LinkName { get; }
        public int Mode { get; }
        public int Uid { get; }
        public int Gid { get; }
        public string UName { get; }
        public string GName { get; }
        public int DevMajor { get; }
        public int DevMinor { get; }
        public long Length { get; }
        public DateTime LastWriteTime { get; }
        public int CheckSum { get; }
        public TarArchiveEntryType EntryType { get; }
        public Stream Open();
        public override string ToString();
    }
    public enum TarMode
    {
        Read = 0
    }
    public class TarOptions
    {
        public TarMode Mode { get; set; }
        public bool LeaveOpen { get; set; }
        public Encoding EntryNameEncoding { get; set; }
        public TarOptions();
    }
    public enum TarArchiveEntryType
    {
        Normal, // Old normal :\0, New normal: 0
        Link, // 1
        SymbolicLink, // 2
        Character, // 3
        Block, // 4
        Directory, // 5
        Fifo, // 6
        Contiguous, // 7
        LongLink, // L
    }
}
Writing APIs

The next step would be to add writing capabilities:

  • Creating archives
  • Adding new entries
  • Deleting existing entries
namespace System.IO.Compression
{
    public class TarArchive : IDisposable
    {
        public void AddEntry(TarArchiveEntry entry) { }
    }
    public class TarArchiveEntry
    {
        public int Mode { set; }
        public int Uid { set; }
        public int Gid { set; }
        public string UName { set; }
        public string GName { set; }
        public int DevMajor { set; }
        public int DevMinor { set; }
        public DateTime LastWriteTime { set; }
        public TarArchiveEntryType EntryType { set; }
        public void Delete() { }
    }
    public enum TarMode
    {
        Create = 1,
        Update = 2
    }
}
Static APIs - Sync

These APIs were heavily inspired in the ZipFile APIs.

namespace System.IO.Compression
{
    public static class TarFile
    {
        public static void CreateFromDirectory(string sourceDirectoryName, string destinationArchiveFileName);
        public static void ExtractToDirectory(string sourceArchiveFileName, string destinationDirectoryName);
        public static void ExtractToDirectory(string sourceArchiveFileName, string destinationDirectoryName, bool overwriteFiles);
        public static void ExtractToDirectory(string sourceArchiveFileName, string destinationDirectoryName, Encoding? entryNameEncoding);
        public static void ExtractToDirectory(string sourceArchiveFileName, string destinationDirectoryName, Encoding? entryNameEncoding, bool overwriteFiles);
    }
}

The following static methods would be powerful because they would be able to decompress the file, then read the internal tar.

We are unsure if from the perspective of API design, it makes sense to mix purposes.

namespace System.IO.Compression
{
    public enum CompressionMethod
    {
        None,
        GZip,
        Deflate,
        Brotli,
        ZLib,
        // More in the future
    }
    public class TarFileOptions
    {
        TarFileOptions();
        CompressionMethod Method { get; set; } // Default is None
        TarMode Mode { get; set; } // Default is Read
        Encoding EntryNameEncoding { get; set; } // Default is ASCII
    }
    public static class TarFile
    {
        public static TarArchive Open(string archiveFileName, TarFileOptions? options);
        public static TarArchive OpenRead(string archiveFileName, CompressionMethod compressionMethod);
    }
}
Static APIs - Async
namespace System.IO.Compression
{
    public static class TarFile
    {
        public static ValueTask CreateFromDirectoryAsync(string sourceDirectoryName, string destinationArchiveFileName, CancellationToken cancellationToken);
        public static ValueTask ExtractToDirectoryAsync(string sourceArchiveFileName, string destinationDirectoryName, CancellationToken cancellationToken);
        public static ValueTask ExtractToDirectoryAsync(string sourceArchiveFileName, string destinationDirectoryName, bool overwriteFiles, CancellationToken cancellationToken);
        public static ValueTask ExtractToDirectoryAsync(string sourceArchiveFileName, string destinationDirectoryName, Encoding? entryNameEncoding, CancellationToken cancellationToken);
        public static ValueTask ExtractToDirectoryAsync(string sourceArchiveFileName, string destinationDirectoryName, Encoding? entryNameEncoding, bool overwriteFiles, CancellationToken cancellationToken);
        public static ValueTask<TarArchive> OpenAsync(string archiveFileName, TarFileOptions options, CancellationToken cancellationToken);
        public static ValueTask<TarArchive> OpenReadAsync(string archiveFileName, CompressionMethod compressionMethod, CancellationToken cancellationToken);
    }
}
Static extension APIs - Sync

These extension APIs are similar to the ZipArchiveEntry ones.

We could directly add these methods to the TarArchiveEntry class instead of making them extensions, since we are currently designing it all at the same time.

The overwriteFiles boolean argument should be clearly documented with warnings about potential tarbomb behavior.

namespace System.IO.Compression
{
    public static class TarFileExtensions
    {
        public static TarArchiveEntry CreateEntryFromFile(this TarArchive destination, string sourceFileName, string entryName);
        public static void ExtractToDirectory(this TarArchive source, string destinationDirectoryName);
        public static void ExtractToDirectory(this TarArchive source, string destinationDirectoryName, bool overwriteFiles);
        public static void ExtractToFile(this TarArchiveEntry source, string destinationFileName);
        public static void ExtractToFile(this TarArchiveEntry source, string destinationFileName, bool overwrite);
    }
}
Static extension APIs - Async
namespace System.IO.Compression
{
    public static class TarFileExtensions
    {
        public static ValueTask<TarArchiveEntry> CreateEntryFromFileAsync(this TarArchive destination, string sourceFileName, string entryName, CancellationToken cancellationToken);
        public static ValueTask ExtractToDirectoryAsync(this TarArchive source, string destinationDirectoryName, CancellationToken cancellationToken);
        public static ValueTask ExtractToDirectoryAsync(this TarArchive source, string destinationDirectoryName, bool overwriteFiles, CancellationToken cancellationToken);
        public static ValueTask ExtractToFileAsync(this TarArchiveEntry source, string destinationFileName, CancellationToken cancellationToken);
        public static ValueTask ExtractToFileAsync(this TarArchiveEntry source, string destinationFileName, bool overwrite, CancellationToken cancellationToken);
    }
}

Usage examples

Here is a basic example of opening a tar.gz file for reading. First we decompress the gzip, then we read the archive.

using FileStream fs = File.Open("file.tar.gz", FileMode.Open);

using var decompressor = new GZipStream(fs, CompressionMode.Decompress);
using var decompressedStream = new MemoryStream();
decompressor.CopyTo(decompressedStream);

var options = new TarOptions{ Mode = TarMode.Read, };
using var archive = new TarArchive(decompressedStream, options);

while (archive.TryGetNextEntry(out TarArchiveEntry? entry))
{
    Console.WriteLine($"{entry.FullName}");
}

TODO: More examples to come.

Tar format description

Optional read. Feel free to skip.

A tar archive is a linear sequence of blocks. Each block consists of a header and the file contents described by that header.

The blocks are aligned to a fixed block size, usually 512. In other words, a block size needs to be a multiple of the block size, which can be achieved by adding trailing null bytes at the end of the file contents, when necessary.

The header describes the metadata of the file contents (filename, mode, uid, guid, size, last modification time, etc.). The size of a header is fixed. Its fields all have a predefined max size.
The file contents can be 0 or more raw bytes, representing the contents of the file.

If the block represents a directory, the file contents can optionally be 0. It's not 0 when it contains a list of the filesystem entries inside that directory, which some tar format versions allow.

A tar archive is navigated by jumping from header to header. The beginning of the next header can be found by adding up the fixed size of a header plus the size of the file contents, minding the block size padding.

Tar archives do not contain a central directory like zip archives. A zip central directory is an uncompressed region of the zip archive that indicates the total number of files in the archive. If the user wants to know the total number of files contained in a tar archive, the whole archive needs to be traversed to count the total number of block headers found.

The tar spec was not designed to include compression capabilities, but tars are commonly combined with a compression method. The most popular method is to first generate the tar file, then compress it, usually with GZip (.tar.gz) or with LZMA (.tar.xz). While this method simplifies and separates the archival and compression stages, it also means that the only way the user can read the contents of the tar file is by decompressing it first.

Another not-so-common method is to compress the file contents individually, leaving the header readable by the user. The reason why it's not so common is because the header offers no field to indicate which compression method was used to compress each file contents block, so the user needs to preserve that information somewhere else.

There are multiple versions of the tar format: v7, ustar, pax, gnu, oldgnu, solaris, aix, macosx. We should focus on v7, ustar, pax and gnu.

Sources:

Open questions

Tar versions

  • Should we implement the different tar versions separately? As Ian suggested above, we could gradually add support to the more complex ones:
    • Add TarArchive/TarArchiveEntry with V7&UStar support and format header detection. Throw error for unsupported formats (e.g. GNU, PAX)
    • Add tar archival/de-archival support for GNU tar
    • Add tar archival/de-archival support for PAX/Posix tar

Assembly

  • In which assembly should the stream-based APIs live?
    • System.IO.Compression
    • System.IO.Compression.Tar
  • In which assembly should the static APIs live?
    • System.IO.Compression
    • System.IO.Compression.TarFile (similar to ZipFile)

TarArchiveEntry

  • Do we need to expose Mode, Uid, Gid, UName, GName?
  • How commonly used are DevMajor and DevMinor? Do users need these properties to be exposed at all?
  • Do we need the EntryType property? I'd say yes, especially because some entries are LongLink and the actual entry is expected to be located in the next position.
    • If we do, should the values of the enum be the exact values that can be found in the tar header, or should we assign default values, then map them internally to the actual value?
    • If the user adds an entry, what EntryType values should be allowed? Can the user programatically add a Block, Fifo Contiguous, Character entry?
  • We use FullName to be consistent with other full path properties. But should we instead use FullPath?
  • Mode, Uid and Guid are in base 10, but they will be converted to base 8 internally.
  • How should we distinguish between the end of file and a corrupt entry when calling TryGetNextEntry? EOF is marked in a tar with two 512-byte blocks filled with nulls.

TarOptions

  • Should the properties be settable, or should they be init? Consider that the TarArchive would cache it, but it may not make sense for the user to be able to change the value of the cached options.

Static APIs

  • The static extension APIs are inspired on the Zip ones. Since these are all being added together, maybe we don't need them as extensions, but they can be part of the class they extend. Thoughts?
  • There are many shared fields in the different overloads. Should we have a separate class (similar to TarOptions) one for extraction and another one for creation? We can pasa an instance of this class as an argument, and have only one method, instead of several overloads. This would be helpful in case we grow the options in the future.

Compression

  • Notice that none of the APIs offer compression. Should we add static methods that allow the user to create a compressed tar file, and let them choose the desired algorithm?
  • .NET currently only offers GZip, Deflate, Brotli and ZLib. We are considering adding support for ZStandard and LZMA. We could consider adding static APIs that allow composability with external compression stream-based APIs like those offered by SharpCompress or SharpZLib.

Security

  • Notice there is no Entries property, like in zip. This is because we don't have a central directory. If we receive a network stream, we wouldn't be able to know the Count.
  • Internally, we would only cache the list of visited entries if the TarArchive is opened in Create or Update mode. This is because the assumption is that we will modify the tar file on dispose, either because we want to add new entries, or because we want to delete existing entries.
    • One thing we could do is add an Entries property that can only be used if the stream is a seekable FileStream, in which case we can use the new RandomAccess APIs to get the files.
  • If the user opens an existing tar file, and an existing entry has a Mode, Uname and/or GName that does not match that of the current user, should we allow the user read/update/delete/extract that entry, or should we forbid access to it?
  • Tarbombs happen when a tar file is extracted into an existing directory and overwrites existing files.
    • They also give problems when an entry has an absolute path, and on extraction, it could potentially overwrite a system file.
    • The fact that tars can contain symbolic links can also be problematic if it is expected to extract files into a symlinked folder. By default, we should not follow symlinks.
    • Tar files allow having multiple files with identical path and filename. A tarbomb behavior could happen if the first extracted file is a symlink, and the next one is a regular file, in which case the second file could end up being written in the target location of the symlink. We should avoid such behaviors. One possible solution is to cache all the names, see if it already existed, and the subsequent duplicates are extracted with a suffix in their name. Another behavior is to throw.

Testing

  • We have an initial set of files in dotnet/runtime-assets created with the Ubuntu tar command, which generates gnutar files.
  • @adamhathcock would it be ok with you if we reuse the test tar files you have in your repo? You have a good selection of test cases.
@ianhays
Copy link
Contributor

ianhays commented Dec 17, 2015

Right now corefx supports zip files as well as gz files. Would it be hard to get it to support tar files as well for compatibility with the other OS's who package files as tgz very often?

We don't have any existing tar code to leverage, if that's what you mean. Tar is a pretty different format from Zip (particularly since it doesn't compress) so we would need to start mostly from scratch. That's not to say it isn't worthwhile, though. I'd love to be able to handle all popular compression formats.

I'm not sure though where we would even want to put this if we did add it. System.IO.Compression only kind of makes sense since a tar doesn't compress. I guess FileSystem? tar is so frequently associated with gzip it seems incorrect to not place it alongside it.

If this is something that is desired I could work on designing an API, but it shouldn't be hard to visualize how it might look like.

In my opinion it would be ideal for it to be as similar to ZipArchive as possible.

@ianhays
Copy link
Contributor

ianhays commented Jun 27, 2016

@ericstj @jasonwilliams200OK

@ghost
Copy link

ghost commented Jun 27, 2016

Thanks @ianhays.

From dotnet/corefx#9673:

Also .bz2 if possible (http://www.bzip.org/1.0.3/html/zlib-compat.html). :)

Usually bz2 is the compressed format which contains a tarball (as bz2 only compresses one file: https://en.wikipedia.org/wiki/Bzip2). We can probably use the same API methods to support bz2 (except for some format specific settings).

This way, tarball expansion / contraction might make sense in S.I.C as part of bz2 (or even zip) compression / decompression.

@ianhays

This comment has been minimized.

@jstarks
Copy link

jstarks commented Aug 24, 2016

@ianhays I recently implemented a set of classes for manipulating tar files, including ustar and PAX (but not GNU, IIRC) support. We needed this for our Docker PowerShell cmdlets. They might be a good starting point for this work.

https:/Microsoft/Docker-PowerShell/tree/master/src/Tar

@jstarks
Copy link

jstarks commented Aug 24, 2016

I should also note that tar archives are fundamentally different from zip archives in that they are stream-oriented and do not contain a central directory of files. This means that both TarArchive.GetEntry and TarArchiveEntry.Open are unnatural: to implement either, you have to require a seekable stream, or you have to buffer the contents of the entire archive into memory or a temporary file (which obviously is uncompetitive from a performance perspective). And it's not realistic to require a seekable stream, since you'll want to support decompressing and extracting tar.gz files in one pass, and decompressors such as GZipStream are not seekable.

The reality is that tar demands a very different interface from zip. You may want to look at my implementation for some ideas of what works better with tar.

@ianhays
Copy link
Contributor

ianhays commented Aug 25, 2016

And it's not realistic to require a seekable stream, since you'll want to support decompressing and extracting tar.gz files in one pass, and decompressors such as GZipStream are not seekable.

In my above comment (under "Implementation") I was operating under the tentative plan that Entry indexing would require a seekable stream or throw an exception if it wasn't, but as you said this isn't likely to be frequently done since it will nearly always be wrapped in a GZip or LZMA stream. It may be worth adding anyways to cover edge cases, but I doubt it. Enumerating entries would be the preferred way of reading the archive.

The reality is that tar demands a very different interface from zip. You may want to look at my implementation for some ideas of what works better with tar.

The nice thing about not having a common parent with ZipArchive is that we can diverge the interface where it's necessary. While it would be ideal to have the API be similar, it isn't required. That said, I think we can at least keep the TarArchive/TarArchiveEntry structure if we just make some tweaks.

I recently implemented a set of classes for manipulating tar files, including ustar and PAX (but not GNU, IIRC) support. We needed this for our Docker PowerShell cmdlets. They might be a good starting point for this work.

Thanks @jstarks, that looks very close to what I had in mind with the exception of some minor API differences (e.g. a unified TarArchive class rather than a TarReader/TarWriter, IEnumerable Entries, disposable TarArchive) and of course the removal of indexing entries. Regardless of the structure though, the implementation looks good and could be easily adapted into the code base with some minor tweaks.

@ianhays ianhays removed their assignment Oct 10, 2016
@danmoseley
Copy link
Member

Per discussion with @ianhays a conservative estimate for all this is 5 weeks, if it was forward only that would be less time.

@ravarnamsft
Copy link

+1 on requesting this support on .net core. This would be super helpful for any cloud service that's trying to untar developer uploaded files from a linux machine. We are one of them (that uses .net core). Would really prefer to avoid picking third party libraries or resorting to running a shell process to achieve this. This becomes more crucial due to the fact that file permissions on zip archives are not set correctly for files archived on Linux. Given that neither this nor zip are fully functional out of the box on Linux makes it hard to support Linux platform on our service in a clean way.

@carlossanlop
Copy link
Member

Triage:
We are interested in doing this. We would also like to add tar ball (tar.gz) support, and have scenario parity with ZipFiles.
Next step: Finish the API design and bring it for review.

@adamhathcock
Copy link

As the author of Sharpcompress, my key wish for this and Zip is to make the API stream oriented and not require a seekable stream.

Then I can base my code on yours or not support it altogether!

@carlossanlop
Copy link
Member

Transferring to the dotnet/runtime repo.

@carlossanlop carlossanlop transferred this issue from dotnet/corefx Jan 9, 2020
@Dotnet-GitSync-Bot Dotnet-GitSync-Bot added the untriaged New issue has not been triaged by the area owner label Jan 9, 2020
@carlossanlop carlossanlop added this to the 5.0 milestone Jan 9, 2020
@carlossanlop carlossanlop added api-suggestion Early API idea and discussion, it is NOT ready for implementation area-System.IO.Compression and removed untriaged New issue has not been triaged by the area owner labels Jan 9, 2020
@stephentoub stephentoub modified the milestones: 5.0, Future Feb 15, 2020
@iamcarbon
Copy link

Any chance this may be prioritized for .NET 6?

@deinok
Copy link

deinok commented Dec 10, 2020

@stephentoub @danmosemsft Can we make an API proposal based on @ianhays and @adamhathcock ?
In case I want to move this forward, should I create a new Issue for the API proposal?

@danmoseley
Copy link
Member

danmoseley commented Dec 10, 2020

@deinok it's your preference, we usually like to have the latest proposal maintained in the top post, so in this case it may be easiest to create a new proposal using the template below and link to this which we can close, but you can also maintain it in a new comment here.
https:/dotnet/runtime/issues/new?assignees=&labels=api-suggestion&template=02_api_proposal.md&title=

To approve the API we also need to have a pretty good idea how we implement it. In the platform, it needs to be fast and stable. Any parser needs to be secure, and well fuzz tested. There's some existing code mentioned above -- it would be interesting to know whether it is compatibly licensed and xplat, high performance, stable etc. Or, whether the right approach would be to reimplement using existing code as appropriate. Realistically if it would be to be in .NET 6 it would likely be a community effort.

It does seem reasonable to me in principle to have in the platform notwithstanding that.

@danmoseley
Copy link
Member

I should note that @ericstj and @carlossanlop own this area not me. Note that @ianhays is no longer on the team - but if you're reading this Ian, I hope all is well with you!

@deinok
Copy link

deinok commented Dec 12, 2020

Okey, thanks for the help @danmosemsft I will make a proposal in a new issue. Until that, lets have this open

@adamhathcock
Copy link

I'm happy to help design and write/donate code for various algorithms and formats as I'm interested in a streaming API.

I looked at reusing the current zip implementation in the runtime but it's all file or seeking based.

@carlossanlop carlossanlop self-assigned this Oct 28, 2021
@carlossanlop
Copy link
Member

carlossanlop commented Oct 28, 2021

The issue description has been updated based on the initial proposal by @ianhays, and the feedback we received.

@danmoseley
Copy link
Member

@carlossanlop I wonder whether it's worth moving the above into a new issue, and linking to/closing this one? That way its the topmost entry.

@piksel
Copy link

piksel commented Oct 29, 2021

@carlossanlop First of all I'd like to state that both me and @Numpsy have done bugfixes etc. on the Tar APIs, but we are not the authors. The code base is almost 20 years old... (😲).
I think the proposal looks generally sound (great work!), my only suggestion is regarding

Should we implement the different tar versions separately?

Supporting all "formats" is probably a lot easier to do from the start than it would be to add them gradually. That wouldn't mean supporting all the different extensions (which are just "unknown" file types), just the general structure. This might be what was intended by "Add tar archival/de-archival support for GNU tar".
It should probably also be aware of the OLDGNU header as well which is not compatible with the POSIX format and at least gracefully ignore it (ustar \0 instead of ustar \0 00).

@adamhathcock
Copy link

@carlossanlop Please take from SharpCompress whatever you need.

One thing I notice, is that it's unclear if you're supporting Forward-only (network stream) scenarios for Reading and/or Writing.

Is TryGetNextEntry the only accessor for this? I would recommend having a separate Reader for this scenario to draw a clear distinction for Streaming vs RandomAccess. Also, Zip could be better implemented with this scenario as well.

I would also recommend the Archive objects do have Entries but does what you're looking at: caching the entries and seek points if people really want to foreaech over the collection. This is what SharpCompress attempts to do for formats that do not have a dictionary.

As for compression/writing, again I recommend having a separate Writer object. SharpCompress attempts to look similar to StreamWriter that writes entries one at a time to a backing Stream. No caching needed. Then I built Archive writing on top of that that holds all desired Creates/Updates in memory then only writes on save. This is similar to what you have in mind.

I do think it's easier if you cover Streaming (forward-only) only scenarios first then build Random Access ones on top of that. Zip should be retrofitted onto that. Though Zip is a bit more challenging as it only supports streaming scenarios if the file entries have a suffix trailer telling you the expected size vs having to read it from the dictionary.

Obivously, I'm biased and I think the basic strategy I've taken with SharpCompress is the best if you want to support Streaming (forward-only) scenarios. If there's things I'm missing, I'm happy to be wrong.

If you choose not to support streaming , I'd like more of the internals to be exposed for the file formats/compression algos so that I can base SharpCompress on the your specific file format implementations.

@jeffhandley jeffhandley modified the milestones: Future, 7.0.0 Jan 9, 2022
@jeffhandley jeffhandley added the User Story A single user-facing feature. Can be grouped under an epic. label Jan 13, 2022
@jeffhandley
Copy link
Member

@carlossanlop et al. I edited the issue description to promote the updated/latest proposal up to the top (instead of closing this and creating a new issue).

@carlossanlop
Copy link
Member

Hey everyone, I created a new issue with an updated proposal to get fresh feedback: #65951

I'm closing this issue but I'll link it in the new description.

@carlossanlop carlossanlop removed their assignment Feb 28, 2022
@ghost ghost locked as resolved and limited conversation to collaborators Mar 30, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
api-suggestion Early API idea and discussion, it is NOT ready for implementation area-System.IO.Compression User Story A single user-facing feature. Can be grouped under an epic.
Projects
None yet
Development

No branches or pull requests