7

It is known that the standard library of C++11 allows to easily convert a string from UTF-8 encoding to UTF-16. However, the following code successfully converts invalid UTF-8 input (at least under MSVC2010):

#include <codecvt>
#include <locale>
#include <string>

int main() {
    std::string input = "\xEA\x8E\x97" "\xE0\xA8\x81" "\xED\xAE\x8D";
    std::wstring_convert<std::codecvt_utf8_utf16<char16_t>, char16_t> converter;
    try {
        std::u16string output = converter.from_bytes(input.data());
        printf("Converted successfully\n");
    }
    catch(std::exception &e) {
        printf("Error: %s\n", e.what());
    }
}

The string here contains 9 bytes, 3 code points. The last code point is 0xDB8D, which is invalid (fits into the range of surrogates).

Is it possible to check UTF-8 string for perfect validity using only standard library of modern C++? Here I mean that all the invalid cases as described in wikipedia article are not allowed.

stgatilov
  • 5,333
  • 31
  • 54
  • 1
    Sure, you can always write code in modern C++ that does what you ask for. – Kerrek SB Jan 14 '17 at 17:31
  • @KerrekSB: Thank you for suggestion =) I hope to find some easy way like `converter.is_valid(std::string("..."))`. – stgatilov Jan 14 '17 at 17:36
  • 1
    I think a better question is, is it possible to get a conversion where invalid input bytes result in some specified replacement? That would be useful. Getting an exception isn't useful to me (I can imagine that someone thinks it's useful, since it is the behavior, but really, translate 4 GB of text and a little problem with the last byte and lose all, that's not useful to me). – Cheers and hth. - Alf Jan 14 '17 at 17:38
  • Doesn't `wstring_convert` throw on error? – Kerrek SB Jan 14 '17 at 17:38
  • @KerrekSB: Yes, it throws on error. But it does *not* always get error on invalid UTF-8 input. At least on my compiler. You can run it and see yourself. – stgatilov Jan 14 '17 at 17:41
  • @stgatilov: That's probably a QoI issue, or a even library implementation bug. – Kerrek SB Jan 14 '17 at 17:45
  • @KerrekSB: Well, it also converts the string successfully on [ideone](http://ideone.com/hRl7wA) (on GCC 5.1). Moreover, when I convert the result back, it also goes successfully. But I see only 7 bytes as result. – stgatilov Jan 14 '17 at 17:53
  • I'm afraid that you need to use *explicitly* something like [`Char.IsLowSurrogate()` and `Char.IsHighSurrogate()` and/or `Char.IsSurrogatePair()` methods](https://msdn.microsoft.com/en-us/library/xcwwfbb8(v=vs.110).aspx?cs-save-lang=1&cs-lang=cpp#code-snippet-1) to check string validity. Unfortunately, I donť know their equivalents in a non-`.NET` environment. – JosefZ Jan 15 '17 at 11:02

1 Answers1

0

In the official UTF-8 doc https://www.ietf.org/rfc/rfc3629.txt

  • The first octet of a multi-octet sequence indicates the number of octets in the sequence - so you determine if the length is correct.
  • The octet values C0, C1, F5 to FF never appear - so you check that they don't appear in the UTF-8 string
Mechi
  • 11
  • 3
  • Well, there are many good ways to decode UTF-8 with full validation. My favorite is [state-machine based code by Bjoern Hoehrmann](https://bjoern.hoehrmann.de/utf-8/decoder/dfa/). The question is whether C++ standard includes anything which does it out-of-the-box and why C++ feature which should do it does not work correctly. – stgatilov Oct 03 '22 at 18:53