1

I am trying to read a text file and writing to a new text file. The input file could be ANSI or UTF-8. I don't care what the output encoding is but I want to preserve all characters when writing. How to do this? Do I need to get the input file's encoding (seems like alot of work).
The following code reads ANSI file and writes output as UTF-8 but there is some gibberish characters "�".

I am looking for a way to read the file no matter which of the 2 encoding and write it correctly without knowing the encoding of input file before hand.

File.WriteAllText(outputfile,File.ReadAllText(inputfilepath + @"\ST60_0.csv"));

Note that this batch command reads a UTF-8 and ANSI file and writes the output as ANSI with all chars preserved so I'm looking to do this but in C#:

type ST60_0.csv inputUTF.csv > outputBASH.txt
paulsm4
  • 114,292
  • 17
  • 138
  • 190
Lightsout
  • 3,454
  • 2
  • 36
  • 65
  • 2
    `Do I need to get the input file's encoding` - no, you're supposed to *know* it. `The following code reads ANSI file` - that's [not what happens](https://learn.microsoft.com/en-us/dotnet/api/system.io.file.readalltext?view=net-5.0#System_IO_File_ReadAllText_System_String_) either, it "attempts to automatically detect the encoding of a file based on the presence of byte order marks". – GSerg Aug 20 '21 at 21:27
  • How about [File.Copy](https://learn.microsoft.com/en-us/dotnet/api/system.io.file.copy?view=net-5.0)? Or am I missing something? – Steeeve Aug 20 '21 at 21:32
  • @Steeeve I need to append multiple input file to a output file. – Lightsout Aug 20 '21 at 21:33
  • Without any processing? Do all input files have the same encoding? – Steeeve Aug 20 '21 at 21:40
  • No they don't they could be ANSI or UTF-8 – Lightsout Aug 20 '21 at 21:41
  • 1
    That is [not well defined](https://en.wikipedia.org/wiki/ANSI_character_set) either. If you meant they are ASCII (the first 128 ASCII), then you may treat them all [as UTF-8](https://stackoverflow.com/a/21297794/11683). Judging by the fact that you are getting question marks, they are not, so you are supposed to know and specify the encoding correctly when reading them. – GSerg Aug 20 '21 at 21:45
  • If the utf8 encoded files doesn't have a BOM, simply open them all and concatenate them (maybe with [Stream.CopyTo](https://learn.microsoft.com/de-de/dotnet/api/system.io.stream.copyto?view=net-5.07) to a new binary FileStream. – Steeeve Aug 20 '21 at 21:47
  • @Steeeve There is [`File.AppendAllText`](https://learn.microsoft.com/en-us/dotnet/api/system.io.file.appendalltext?view=net-5.0) for that, but you still need to know the encoding. – GSerg Aug 20 '21 at 21:49
  • @GSerg AppendAllText would involve some encoding, whereas CopyTo doesn't. But this would work only if there is no BOM in the utf8 encoded source files – Steeeve Aug 20 '21 at 21:53
  • @Steeeve If you know the encoding, you can use `AppendAllText`. If you don't know the encoding, you can use neither `AppendAllText` nor `Stream.CopyTo`, because otherwise you may end up with a file different chunks of which use different encodings. – GSerg Aug 20 '21 at 21:58
  • @GSerg ASCII encoding in the middle of an utf8 encoded file wouldn't make any difference. If all of the input files are utf8 or ascii encoded I don't see any problem. – Steeeve Aug 20 '21 at 22:02
  • Sorry, my mistake! I have read ASCII instead of ANSI. Sorry for the noise... – Steeeve Aug 20 '21 at 22:08
  • When you say *ANSI encoded*, what do you actually mean? ANSI is not an encoding. You probably mean *Local* Encoding (Encoding.Default, related to the current machine and language, or another Encoding that is using the Local CodePage of a specific machine that uses a specific language). Do you know what that is? Or these files can have any origin (anywhere in the World, any Language)? – Jimi Aug 20 '21 at 22:08
  • 2
    Q: Does the file in question have a BOM? Q: Can you tell us the hex value of one of your "?" characters? Q: What do you believe the character was supposed to be in the original "ASCII?" text file? Please update your post with this information. – paulsm4 Aug 20 '21 at 22:11
  • @Jimi ANSI according to notepad/notepad++ – Lightsout Aug 20 '21 at 22:14
  • Means Local Encoding (the Local CodePage). Encoding.Default. ANSI. per se, is not an Encoding, it's MSFT / Windows *jargon*. Sometimes it's referred to CodePage 1252, sometimes (old style) CodePage 8859-1. Use `Encoding.GetEncoding()` and try one of these; besides `Encoding.Default`. – Jimi Aug 20 '21 at 22:21
  • I may just ending up calling batch blows my mind its so hard to do something in C# that command prompt can do easy – Lightsout Aug 20 '21 at 22:34

1 Answers1

1

Q: The following code reads ANSI file and writes output as UTF-8 but there is some giberrish characters "�".

A: It would definitely be useful to see the hex values of some of these "gibberish" characters. Perhaps you could install a Hex plugin to Notepad++ and tell us?

Q: It blows my mind its so hard to do something in C# that command prompt can do easy

A: Typically, it IS easy. There seems to be "something special" written into this particular file.

The difference between C# and other, "simpler" approaches is that C# (unlike C character I/O or .bat files) gives you the flexibility to deal with text that doesn't happen to be "standard ASCII".

ANYWAY:

If "?" you posted (hex 0xefbfbd) is a valid example of your actual text, this might explain what's going on:

https://stackoverflow.com/a/25510366/421195

... %EF%BF%BD is the url-encoded version of the hex representation of the 3 bytes (EF BF BD) of the UTF-8 replacement character.

See also:

https://en.wikipedia.org/wiki/Specials_(Unicode_block)

The Replacement character � (often displayed as a black rhombus with a white question mark) is a symbol found in the Unicode standard at code point U+FFFD in the Specials table. It is used to indicate problems when a system is unable to render a stream of data to a correct symbol.[4] It is usually seen when the data is invalid and does not match any character

You might also be interested in this:

https://learn.microsoft.com/en-us/dotnet/standard/base-types/character-encoding

Best-Fit Fallback When a character does not have an exact match in the target encoding, the encoder can try to map it to a similar character.


UPDATE:

The offending character was "»", hex 0xc2bb. This is a "Right Angle Quote", a Guillemet. Angle quotes are the quotation marks used in certain languages with an otherwise roman alphabet, such as French.

One possible solution is to specify "iso-8859-1", vs. the default encoding "UTF-8":

File.WriteAllText(outputfile,File.ReadAllText(inputfilepath + @"\ST60_0.csv",  System.Text.Encoding.GetEncoding("iso-8859-1")));
paulsm4
  • 114,292
  • 17
  • 138
  • 190
  • Looks like the char is EF BF BD – Lightsout Aug 21 '21 at 00:02
  • OK: So what is the corresponding character in the source file (ST60_0.csv)? If it's 0xEFBFBD, then that implies "corrupt input". What - if anything - can you do to correct how the .csv is written? Otherwise, if there's nothing you can do about the source file, you might consider the "Best-fit fallback" I cited above. – paulsm4 Aug 21 '21 at 03:19
  • @bakalolo Q: What did you find out about the source file (ST60_0.csv)? What are your thoughts? What are your plans? – paulsm4 Aug 21 '21 at 23:34
  • I gave up just called bash command from c# – Lightsout Aug 23 '21 at 09:38
  • @bakalolo: I'm extremely disappointed we still don't know *WHY* the EF BF BD was occurring. Q: Is the problem in the source file? Q: Do you have any control over how the .csv is written? – paulsm4 Aug 23 '21 at 17:14
  • » was the char idk hex – Lightsout Aug 23 '21 at 20:04
  • @bakalolo: Thank you. It's hex 0xc2bb. It's a "Right-angle quote"; a [](https://en.wikipedia.org/wiki/Guillemet). Please see my update above. And please feel free to upvote my reply, if you wish – paulsm4 Aug 23 '21 at 21:39