No Encoding for Name field is specified, any non-ASCII bytes will be discarded

Question

The following .NET 5.0 code using ICSharpCode.SharpZipLib

var gzipInputStream = new GZipInputStream(sourceStream);
var tarInputStream = new TarInputStream(gzipInputStream);

var gZipOutputStream = new GZipOutputStream(destinationStream);
var tarOutputStream = new TarOutputStream(gZipOutputStream);

now emits warnings

[CS0618] 'TarInputStream.TarInputStream(Stream)' is obsolete: 
    'No Encoding for Name field is specified, any non-ASCII bytes will be discarded'

[CS0618] 'TarOutputStream.TarOutputStream(Stream)' is obsolete: 
    'No Encoding for Name field is specified, any non-ASCII bytes will be discarded'

What Encoding should I specify when constructing TarInputStream and TarOutputStream?

If you stored text files in the tarball you'd have to specify the same encoding used in the files. Encodings don't apply to binary files. Try `Encoding.UTF8`, which is the default in .NET Core file operations anyway — Panagiotis Kanavos, Jan 22 '21 at 16:14
@PanagiotisKanavos: _"Encodings don't apply to binary files"_ -- yes, they do, when those binary files include encoded text. In particular, the .zip archive format _does_ require a choice of encoding for dealing with the path names for items within the archive. — Peter Duniho, Jan 22 '21 at 17:02
@PeterDuniho I meant the file itself isn't affected by encodings, not its contents. This very question is about how encodings affect a tarball's contents. The GZIP data won't be affected no matter what the encoding setting is. Which points to another problem here - this produces something like `gz.tar`, not a `tar.gz` file. Unless the OP takes care with encodings, any text files inside the GZIP data may get mangled. The OP will have to extract the Encoding from the TAR file and apply it to the decompressed data. Oops — Panagiotis Kanavos, Jan 22 '21 at 17:25
@alik did you really intend to TAR a GZIPped file? Typically it's the other way round. Multiple files are combined in one TAR, then compressed with GZip. That's why you see `.tar.gz` extensions, not `gz.tar`. If you only have one file, you don't need TAR. — Panagiotis Kanavos, Jan 22 '21 at 17:26
@PanagiotisKanavos: Ah, I see. I mistakenly overlooked that the OP isn't actually dealing with a .zip archive. I guess my answer isn't relevant at all. :( — Peter Duniho, Jan 22 '21 at 17:29
@PeterDuniho I wouldn't say that. UTF8 is a good choice, but the question's code shows a bit of confusion. Why not use an *actual* GZip package for example? Perhaps the OP hadn't considered the option? Or the actual requirement is to produce a `.tar.gz` but the streams are reversed? — Panagiotis Kanavos, Jan 22 '21 at 17:34
The code reads tar.gz file and writes tar.gz file after replacing the content of few files. The input stream is a tar.gz file, which is unzipped by GZipInputStream into a tar file which is untared by the TarInputStream. On the other hand, the files are tared by the TarOutputStream and than gziped to tar.gz represented by the destinationStream. — alik, Jan 22 '21 at 20:13

Brett Caswell · Accepted Answer · 2021-01-22T21:27:06.980

3

The encoding you specify is dependent on the contents of the file, and is subject to what you are trying to achieve\support in your scenario.

~~Since it seems the default is ASCII, you actually don't 'need' to change\specify any Encoding at the moment.~~

In regards to the obsolete flag warning, If you're asking how to handle the warning and keep the default encoding, you could use TarOutputStream.TarOutputStream(Stream, null) ctor method signature.

Update (In reference to Maintainer's comments as well as responses to Github issue)

The default behavior of entry encoding process when specifying null in TarOutputStream.TarOutputStream(Stream, null) is

no encoding conversion / just [copies] the lower 8 bits

In regards to recommendation on specifying encoding:

If you don't know what encoding might have been used, the safest bet is often to specify UTF-8

As such, my recommendation is echoing that advice. You call the non-obsolete constructor and specify Encoding.UTF8.

var gzipInputStream = new GZipInputStream(sourceStream);
var tarInputStream = new TarInputStream(gzipInputStream, Encoding.UTF8);

var gZipOutputStream = new GZipOutputStream(destinationStream);
var tarOutputStream = new TarOutputStream(gZipOutputStream, Encoding.UTF8);

thanks, @piksel bitworks and @Panagiotis Kanavos

edited Jan 22 '21 at 21:27

answered Jan 22 '21 at 16:06

Brett Caswell

1,486
1
13
25

The default isn't ASCII. The warning says that any non-ASCII characters will be lost, not that the default is ASCII. If anything, the default in .NET Core is UTF8, requiring an extra package to support other encodings. *All* encodings (except UTF16, UTF32) use the same byte values for the ASCII range of characters, so these are the only characters that can be preserved - *unless* the actual encoding is specified. – Panagiotis Kanavos Jan 22 '21 at 16:11
`I actually don't see the reason in obsoleting this ctor to begin with.` because ASCII isn't the default at all, and even English words can use diacritics, even in the US. These characters are outside the 7-bi US-ASCII range. The default encoding for text files in .NET Core is UTF. The same for .NET Old - `StreamReader` uses UTF8 by default if no other encoding is specified. A decade ago it used the system's locale. ASCII was *never* used, as it can't even handle Latin1 – Panagiotis Kanavos Jan 22 '21 at 16:16
@PanagiotisKanavos I get what you're saying.. but it is just an interesting semantical point. I don't think this relates to .NET Core default encoding and that codebase seems to use helper methods like `GetAsciiBytes` in scopes of handling encoders. It also specifies passing `null` for Ascii only. So, I think it is still correct for me to say default is ASCII here. – Brett Caswell Jan 22 '21 at 16:22
On the contrary [GetAsciiBytes](https://github.com/icsharpcode/SharpZipLib/blob/master/src/ICSharpCode.SharpZipLib/Tar/TarHeader.cs#L1072) without an explicit encoding is also obsolete. The entire .NET platform uses Unicode. So do Windows. So do Java, JavaScript, Go, Python. The only reason ASCII is mention in the TAR classes is that TAR predates Unicode and like most formats created before 1990 expected a specific codepage (not ASCII, any single-byte codepage) to be used throughout the server. – Panagiotis Kanavos Jan 22 '21 at 16:47
So `GetAsciiBytes` remains only to avoid breaking existing code. It looks like there's a conscious design effort to *avoid* defaults altogether. The currently supported method could easily be renamed to `GetBytes` since the actual encoding has to be supplied explicitly – Panagiotis Kanavos Jan 22 '21 at 16:52
well, no, that's not correct. the design and obsolete concepts aren't moving away from defaults here, they are apparently moving away from overload methods and are instead passing `null` to represent default - conditionally checking and handling for that. one of the reasons I even mentioned lack of 'reasoning' around the obsolete of ctor here is to reference that point of confusion. it isn't being obsoleted for ASCII\non-ASCII reasonings.. it's being obsoleted because the authors are doing away with overloading methods in their design, while adding support for other encodings when specified – Brett Caswell Jan 22 '21 at 17:16
@PanagiotisKanavos you know.. that `null` handling did look a bit questionable after review. So, I put together a little sample of using `null`.. it certainly isn't 'ASCII' encoding properly (even if that was the intent, with para remarks).. [SharpZipLib TarHeader.GetAsciiBytes dotnetfiddle Test](https://dotnetfiddle.net/lATEXU) ...our convo did get into the weeds here a bit, but ultimately I agree that is safer to just designate encodings here. – Brett Caswell Jan 22 '21 at 19:09
My code intends to read a tar.gz file and write a tar.gz file after replacing the content of few files using buffered byte[] data from a MemoryStream while dummy-copying the content of everything else via buffered buty[] copy. So its safe/intended to put null as the Encoding or something else is better? – alik Jan 22 '21 at 20:20
5

SharpZipLib maintainer here. Yes, the obsoleted overloads are there to notify the consumer that they should specify an encoding for the Name field. The old behaviour of the library was to just cast bytes to/from wchars, which probably isn't what you want. If you don't know what encoding might have been used, the safest bet is often to specify UTF-8. This will probably be the default the next major release (since it would be a breaking change). – piksel bitworks Jan 22 '21 at 21:02

No Encoding for Name field is specified, any non-ASCII bytes will be discarded

1 Answers1

Update (In reference to Maintainer's comments as well as responses to Github issue)