Why does my unicode enabled software not recognise 'Š' and other characters in ANSI files? How to fix it?

Question

I have a MFC project which reads and writes from and to ANSI files. The Character Set of the application is set to Unicode.

Addendum
I do not have the possibility to change/influence the encoding of the input and neither the output file because in my context we are talking about a converter between legacy software. The character encoding expected is actually windows-1252.

When reading and writing some files, I noticed that some rarely used characters like Š (0x8A) get replaced by ? (0x3F) when reading and writing them with CStdioFile. I've created a testfile to see what characters are affected in the range between 0x30 and 0xFF.

I copied those chars to a Testfile (ANSI coded) (Characters from 0x30 to 0xFF)

And the resultant file looked like this:

The changed characters are all around the same region and are all changed to 0x3F '?'- starting from 0x80 up to 0x9F. Strangely enough there are some exceptions like 0x81, 0x8D, 0x90 and 0x9D which were not affected.

Example Code to test the behaviour:

//prepare vars
CFileException fileException;
CStdioFile filei;
CStdioFile fileo;
CString strText;


//open input file
filei.Open(TEXT("test.txt"), CFile::modeRead | CFile::shareExclusive | CFile::typeText, &fileException);

//open output file 
fileo.Open(TEXT("testout.txt"), CFile::modeCreate | CFile::modeWrite | CFile::shareExclusive | CFile::typeText, &fileException);

//read and write 
BOOL eof = filei.ReadString(strText) <= 0;
fileo.Write(CStringA(strText), CStringA(strText).GetLength());

//clean up
filei.Close();
fileo.Close();

Why does it do that and what would I need to do to preserve all characters?

Disabling the unicode mode would fix the issue but is unfortunately not an option in my case.

Summary
Here's an extract of the things that were useful for me from the accepted answer:

Don't convert from CStringW to CStringA by just calling it's constructor. When converting from Unicode to "ANSI" (Windows1252), use CW2A:

CStringA strTextA(strText, CP_ACP)` //CP_ACP converts to ANSI
fileo.Write(strTextA, strTextA.GetLength());

Even simpler: use the CStdioFile::WriteString method instead of CStdioFile::WriteS:

fileo.Open(TEXT("testout.txt"), CFile::modeCreate | CFile::modeWrite | CFile::shareExclusive | CFile::typeText, &fileException);
fileo.WriteString(strText);

ANSI encoding is based on codepages. Unless you know the codepage used for encoding the file, and that codepage happens to be active when reading the file back in, you cannot preserve the characters. To avoid any confusion, serialized string streams should be encoded using UTF-8. — IInspectable, Nov 27 '15 at 23:47
@IInspectable I'm aware of that - but it is to be expected that they are based on western europe codepage. And I actually just want to save the file in the same codepage as loaded. I have no control on the codepage the files come in. — Marwie, Nov 27 '15 at 23:51
If you don't have any control over the codepage used to write the file, it's all the more reason to demand UTF-8. This allows you to discard illegal input by verifying whether it conforms to UTF-8. Not possible with ANSI encoding. The data may just be toast when it leaves/enters your application. — IInspectable, Nov 27 '15 at 23:57
@Marwie : `Š` ... western europe. That does not compute. S caron is used in Eastern Europe (Czech/Croatian) — MSalters, Nov 28 '15 at 01:07
@MSalters Just to clarify: when I talk about ANSI it is the ansi which is shown to me on windows machines inside the textfiles analysed by beyond compare or notepad++. After some research I strongly believe that we are talking about an actual [Windows-1252 character encoding](https://en.wikipedia.org/wiki/Windows-1252) here and it definitely contains S charon in that set. — Marwie, Nov 28 '15 at 09:30
@IInspectable I'm not in the position to demand an unicode file - we are talking about legacy software.My application is just a simple file conversion mechanism which takes a file with some text with defined length and outputs some of the fields in the same encoding. The encoding will however always remain the same and in this case I had an example with S charon which didn't work out for a reason mentionend in the comments of Andrews answer. — Marwie, Nov 28 '15 at 09:34

Andrew Komiagin · Accepted Answer · 2015-11-28T07:30:54.177

1

The problem is that by default if you use the CStdioFile::Open method the CStdioFile is only capable of reading/writing ANSI files but you can open the file-stream yourself and then you will be able to specify the correct encoding:

FILE* fStream = NULL;
errno_t e = _tfopen_s(&fStream, _T("C:\\Files\\test.txt"), _T("rt,ccs=UNICODE"));
if (e != 0) 
    return; // failed to open file 
CStdioFile f(fStream);  
CString sRead;
f.ReadString(sRead);
f.Close();

If you'd like to write file you need to use _T("wt,ccs=UNICODE") set of options.

The other obvious problem is that you call Write instead of WriteString. There is no need to convert CStringW to CStringA in case of WriteString. If it is required to use Write for some reason you'll have to properly convert CStringW to CStringA by calling to CW2A() with CP_UTF8.

Here is the sample code that uses general purpose CFile class and Write instead of CStdioFile and WriteString:

CStringW sText = L"Привет мир";

CFile file(_T("C:\\Files\\test.txt"), CFile::modeWrite | CFile::modeCreate);

CStringA sUTF8 = CW2A(sText, CP_UTF8);
file.Write(sUTF8 , sUTF8.GetLength());

Please keep in mind that CFile constructor that opens file and Write method throw CFileException type of exceptions. So you should handle them.

Use the following options when opening text file streams to specify the type of encoding:

"ccs=UNICODE" corresponds to UTF-16 (Big endian)
"ccs=UTF-8" corresponds to UTF-8
"ccs=UTF-16LE" corresponds to UTF-16LE (Little endian)

edited Nov 28 '15 at 07:30

answered Nov 27 '15 at 18:21

Andrew Komiagin

6,446
1
13
23

I actually tried this method when discovering the error. It has exactly the same issue - beware that both files are ANSI files. BTW, setting the `"ccs=ANSI"` will raise an error and according to the [MSDN](https://msdn.microsoft.com/en-us/library/z5hh6ee9.aspx) you have to omit specifying the encoding to read ANSI. – Marwie Nov 27 '15 at 19:23
The other problem is that you convert to `CStringA` using default ASCII charset and then call `Write()` instead of `WriteString()`. The set of symbols like {|}~€‚ƒ„…†‡ˆ‰Š‹ŒŽ‘’“”•–—˜™ are definitely out of ASCII charset. More details in my updated answer. – Andrew Komiagin Nov 27 '15 at 20:12
Ahh - you are absolutely right about the CStringA conversion - that was the evil thing. If you think about it, it's quite stupid to believe that by just cutting away a byte you'll get ANSI out of it, isn't it? :-) `WriteString` was the solution. So in the end it has nothing to do with the way to generate CStdioFile with the stream already open but only with the conversion. Would you mind highlighting this as your primary answer. Just for completeness: I wasn't able to get a running example up with CW2A (same issue as in my question) - how would I need use it with `Write` afterwards? – Marwie Nov 27 '15 at 23:07
I tried the CW2A example and got the following warning `no suitable user-defined conversion from "ATL::CW2A" to "CStringA" exists` – Marwie Nov 30 '15 at 08:42
1

If you are using old version of VS and ATL then you need to call `USES_CONVERSION` – Andrew Komiagin Nov 30 '15 at 08:50
It's actually a VS2015 - I changed the line to: `CStringA correctEncoding(CW2A(strText, CP_ACP));` which works now. – Marwie Nov 30 '15 at 08:56

Why does my unicode enabled software not recognise 'Š' and other characters in ANSI files? How to fix it?

1 Answers1