-1

I'm getting the exception below when trying to load a file using the TStringList.LoadFromFile method:

stringlist1.loadfromfile('c:\example.txt');

No mapping for the Unicode character exists in the target multi-byte code page

The file is Unicode, and the error seems to be related to this special character that exists in the file. The example.txt file has only one line, and its content is exactly as shown below:

Ze 

The file contains these bytes:

EF BB BF 5A 65 20 ED A0 BC ED B7 AB ED A0 BC ED B7 AE

Any workarounds?

Remy Lebeau
  • 555,201
  • 31
  • 458
  • 770
delphirules
  • 6,443
  • 17
  • 59
  • 108
  • 1
    How are you calling `LoadFromFile`? Please include the actual code, and preferably an actual text file. You can use a hex editor to view its actual data, and post it here as a block of preformatted text. – Andreas Rejbrand Aug 18 '20 at 12:51
  • I updated the question. The 'example.txt' file has only one line, exatctly how i show on the question. – delphirules Aug 18 '20 at 12:56
  • 2
    That is not enough. The actual bytes of the text file will look entirely different depending on the encoding: UTF8, UTF16LE, UTF16BE, UTF32LE, or UTF32BE, and in each case, with or without BOM (so 10 different versions!). All these different text files will look identical when you open them in a text editor. – Andreas Rejbrand Aug 18 '20 at 12:57
  • I just uploaded the file and edited the question. – delphirules Aug 18 '20 at 13:23
  • One of the requirements of Stack Overflow is, that questions are self contained, meaning no links to any external sites or resources, that may a) disappear at any time, b) be harmful. So, once more, please edit your question to include the hex representation of the bytes in the file. – Tom Brunberg Aug 18 '20 at 13:41
  • Maybe you don't have access to a hex editor, so here are the bytes: `EF BB BF 5A 65 20 ED A0 BC ED B7 AB ED A0 BC ED B7 AE`. **`EF BB BF`** is the [byte order mark](https://en.wikipedia.org/wiki/Byte_order_mark) for a [UTF-8](https://en.wikipedia.org/wiki/UTF-8)-encoded text file. **`5A 65 20`** is simply `Ze `. Then it gets somewhat more complicated. We are well beyond the BMP here. – Andreas Rejbrand Aug 18 '20 at 13:43
  • I'm not 100% sure, because it was too long ago I studied UTF, but I suspect this is UTF-8 with surrogate pairs. Surrogate pairs belong in UTF-16. Try `EF BB BF 5A 65 20 F0 9F 87 AB F0 9F 87 AE` instead. In any case, the intended characters are [the regional indicator symbols F and I](https://en.wikipedia.org/wiki/Regional_Indicator_Symbol) which when combined apparently are supposed to render a Finnish flag. Unicode sure has become a complicated topic... – Andreas Rejbrand Aug 18 '20 at 14:04
  • @delphirules How exactly is `example.txt` being produced? It was created with a bad encoding, which is why you are having problems loading it. See my answer for more details. – Remy Lebeau Aug 18 '20 at 15:52
  • The customer said they used a text editor called Editpad Lite to create the file. So if it's a bad file, i won't mind. Thanks for clearing this ! – delphirules Aug 18 '20 at 17:59

2 Answers2

5

Your file claims to be encoded as UTF-8, as evident by the 1st 3 bytes EF BB BF, which are the UTF-8 BOM.

In Delphi 2009+, String is a UTF-16 encoded Unicode string, so LoadFromFile() will see the BOM and try to decode the file bytes from UTF-8 to Unicode, then encode that Unicode data to UTF-16 in memory.

However, after the BOM, the next 3 bytes 5A 65 20 are proper UTF-8, but the rest of your file after that is NOT proper UTF-8. That is why you are getting the exception.

The correct byte sequence for the characters you have shown should look like the following:

EF BB BF 5A 65 20 F0 9F 87 AB F0 9F 87 AE

But your file contains these bytes instead:

EF BB BF 5A 65 20 ED A0 BC ED B7 AB ED A0 BC ED B7 AE

As you can see, the byte sequence F0 9F 87 AB F0 9F in the correct file has been mis-encoded as ED A0 BC ED B7 AB ED A0 BC ED in your bad file.

When processed as UTF-8, the good file decodes to the following Unicode codepoint sequence:

U+005A LATIN CAPITAL LETTER Z
U+0065 LATIN SMALL LETTER E
U+0020 SPACE
U+1F1EB REGIONAL INDICATOR SYMBOL LETTER F
U+1F1EE REGIONAL INDICATOR SYMBOL LETTER I

Whereas your bad file decodes to the following sequence instead:

U+005A LATIN CAPITAL LETTER Z
U+0065 LATIN SMALL LETTER E
U+0020 SPACE
U+D83C HIGH SURROGATE - invalid!
U+DDEB LOW SURROGATE - invalid!
U+D83C HIGH SURROGATE - invalid!
U+DDEE LOW SURROGATE - invalid!

Now, it happens that D83C DDEB D83C DDEE is the proper UTF-16 encoded form of Unicode codepoints U+1F1EB U+1F1EE. This means that your original Unicode text was encoded to UTF-16 first, then the individual UTF-16 code units where incorrectly treated as-is as Unicode codepoints (which they are not) and were then encoded accordingly to UTF-8, thus producing your bad file.

If this is the only file affected, then you can simply replace its bytes with the bytes shown above. But if this is part of a larger encoding process that is producing badly encoded UTF-8 files that you can't load afterwards, then you need to figure out where that incorrect UTF-16 handling is occurring and fix that issue.

Remy Lebeau
  • 555,201
  • 31
  • 458
  • 770
1

as a second parameter you could try adding an encoding (for example UTF-8). I am not really sure if this solves your problem, but it could. I hope this helped.

Zoedingl
  • 176
  • 1
  • 10
  • By itself, this will not solve the problem. LoadFromFile() will detect the BOM and decode as UTF-8 automatically, but the file is malformed to begin with, so decoding as UTF-8 won't help until that malformed encoding issue is solved elsewhere. – Remy Lebeau Aug 18 '20 at 18:18