Read multiple files with different encoding, preserving all characters

Question

I am trying to read a text file and writing to a new text file. The input file could be ANSI or UTF-8. I don't care what the output encoding is but I want to preserve all characters when writing. How to do this? Do I need to get the input file's encoding (seems like alot of work).
The following code reads ANSI file and writes output as UTF-8 but there is some gibberish characters "�".

I am looking for a way to read the file no matter which of the 2 encoding and write it correctly without knowing the encoding of input file before hand.

File.WriteAllText(outputfile,File.ReadAllText(inputfilepath + @"\ST60_0.csv"));

Note that this batch command reads a UTF-8 and ANSI file and writes the output as ANSI with all chars preserved so I'm looking to do this but in C#:

type ST60_0.csv inputUTF.csv > outputBASH.txt

`Do I need to get the input file's encoding` - no, you're supposed to *know* it. `The following code reads ANSI file` - that's [not what happens](https://learn.microsoft.com/en-us/dotnet/api/system.io.file.readalltext?view=net-5.0#System_IO_File_ReadAllText_System_String_) either, it "attempts to automatically detect the encoding of a file based on the presence of byte order marks". — GSerg, Aug 20 '21 at 21:27
How about [File.Copy](https://learn.microsoft.com/en-us/dotnet/api/system.io.file.copy?view=net-5.0)? Or am I missing something? — Steeeve, Aug 20 '21 at 21:32
@Steeeve I need to append multiple input file to a output file. — Lightsout, Aug 20 '21 at 21:33
Without any processing? Do all input files have the same encoding? — Steeeve, Aug 20 '21 at 21:40
That is [not well defined](https://en.wikipedia.org/wiki/ANSI_character_set) either. If you meant they are ASCII (the first 128 ASCII), then you may treat them all [as UTF-8](https://stackoverflow.com/a/21297794/11683). Judging by the fact that you are getting question marks, they are not, so you are supposed to know and specify the encoding correctly when reading them. — GSerg, Aug 20 '21 at 21:45
If the utf8 encoded files doesn't have a BOM, simply open them all and concatenate them (maybe with [Stream.CopyTo](https://learn.microsoft.com/de-de/dotnet/api/system.io.stream.copyto?view=net-5.07) to a new binary FileStream. — Steeeve, Aug 20 '21 at 21:47
@Steeeve There is [`File.AppendAllText`](https://learn.microsoft.com/en-us/dotnet/api/system.io.file.appendalltext?view=net-5.0) for that, but you still need to know the encoding. — GSerg, Aug 20 '21 at 21:49
@GSerg AppendAllText would involve some encoding, whereas CopyTo doesn't. But this would work only if there is no BOM in the utf8 encoded source files — Steeeve, Aug 20 '21 at 21:53
@Steeeve If you know the encoding, you can use `AppendAllText`. If you don't know the encoding, you can use neither `AppendAllText` nor `Stream.CopyTo`, because otherwise you may end up with a file different chunks of which use different encodings. — GSerg, Aug 20 '21 at 21:58
@GSerg ASCII encoding in the middle of an utf8 encoded file wouldn't make any difference. If all of the input files are utf8 or ascii encoded I don't see any problem. — Steeeve, Aug 20 '21 at 22:02
Sorry, my mistake! I have read ASCII instead of ANSI. Sorry for the noise... — Steeeve, Aug 20 '21 at 22:08
When you say *ANSI encoded*, what do you actually mean? ANSI is not an encoding. You probably mean *Local* Encoding (Encoding.Default, related to the current machine and language, or another Encoding that is using the Local CodePage of a specific machine that uses a specific language). Do you know what that is? Or these files can have any origin (anywhere in the World, any Language)? — Jimi, Aug 20 '21 at 22:08
Q: Does the file in question have a BOM? Q: Can you tell us the hex value of one of your "?" characters? Q: What do you believe the character was supposed to be in the original "ASCII?" text file? Please update your post with this information. — paulsm4, Aug 20 '21 at 22:11
Means Local Encoding (the Local CodePage). Encoding.Default. ANSI. per se, is not an Encoding, it's MSFT / Windows *jargon*. Sometimes it's referred to CodePage 1252, sometimes (old style) CodePage 8859-1. Use `Encoding.GetEncoding()` and try one of these; besides `Encoding.Default`. — Jimi, Aug 20 '21 at 22:21
I may just ending up calling batch blows my mind its so hard to do something in C# that command prompt can do easy — Lightsout, Aug 20 '21 at 22:34

paulsm4 · Answer 1 · 2021-08-23T21:48:36.177

Q: The following code reads ANSI file and writes output as UTF-8 but there is some giberrish characters "�".

A: It would definitely be useful to see the hex values of some of these "gibberish" characters. Perhaps you could install a Hex plugin to Notepad++ and tell us?

Q: It blows my mind its so hard to do something in C# that command prompt can do easy

A: Typically, it IS easy. There seems to be "something special" written into this particular file.

The difference between C# and other, "simpler" approaches is that C# (unlike C character I/O or .bat files) gives you the flexibility to deal with text that doesn't happen to be "standard ASCII".

ANYWAY:

If "?" you posted (hex 0xefbfbd) is a valid example of your actual text, this might explain what's going on:

https://stackoverflow.com/a/25510366/421195

... %EF%BF%BD is the url-encoded version of the hex representation of the 3 bytes (EF BF BD) of the UTF-8 replacement character.

You might also be interested in this:

https://learn.microsoft.com/en-us/dotnet/standard/base-types/character-encoding

Best-Fit Fallback When a character does not have an exact match in the target encoding, the encoder can try to map it to a similar character.

UPDATE:

The offending character was "»", hex 0xc2bb. This is a "Right Angle Quote", a Guillemet. Angle quotes are the quotation marks used in certain languages with an otherwise roman alphabet, such as French.

One possible solution is to specify "iso-8859-1", vs. the default encoding "UTF-8":

File.WriteAllText(outputfile,File.ReadAllText(inputfilepath + @"\ST60_0.csv",  System.Text.Encoding.GetEncoding("iso-8859-1")));

OK: So what is the corresponding character in the source file (ST60_0.csv)? If it's 0xEFBFBD, then that implies "corrupt input". What - if anything - can you do to correct how the .csv is written? Otherwise, if there's nothing you can do about the source file, you might consider the "Best-fit fallback" I cited above. — paulsm4, Aug 21 '21 at 03:19
@bakalolo Q: What did you find out about the source file (ST60_0.csv)? What are your thoughts? What are your plans? — paulsm4, Aug 21 '21 at 23:34
@bakalolo: I'm extremely disappointed we still don't know *WHY* the EF BF BD was occurring. Q: Is the problem in the source file? Q: Do you have any control over how the .csv is written? — paulsm4, Aug 23 '21 at 17:14
@bakalolo: Thank you. It's hex 0xc2bb. It's a "Right-angle quote"; a [](https://en.wikipedia.org/wiki/Guillemet). Please see my update above. And please feel free to upvote my reply, if you wish — paulsm4, Aug 23 '21 at 21:39

Read multiple files with different encoding, preserving all characters

1 Answers1