Check if UTF-8 string is valid in modern C++

Question

It is known that the standard library of C++11 allows to easily convert a string from UTF-8 encoding to UTF-16. However, the following code successfully converts invalid UTF-8 input (at least under MSVC2010):

#include <codecvt>
#include <locale>
#include <string>

int main() {
    std::string input = "\xEA\x8E\x97" "\xE0\xA8\x81" "\xED\xAE\x8D";
    std::wstring_convert<std::codecvt_utf8_utf16<char16_t>, char16_t> converter;
    try {
        std::u16string output = converter.from_bytes(input.data());
        printf("Converted successfully\n");
    }
    catch(std::exception &e) {
        printf("Error: %s\n", e.what());
    }
}

The string here contains 9 bytes, 3 code points. The last code point is 0xDB8D, which is invalid (fits into the range of surrogates).

Is it possible to check UTF-8 string for perfect validity using only standard library of modern C++? Here I mean that all the invalid cases as described in wikipedia article are not allowed.

Sure, you can always write code in modern C++ that does what you ask for. — Kerrek SB, Jan 14 '17 at 17:31
@KerrekSB: Thank you for suggestion =) I hope to find some easy way like `converter.is_valid(std::string("..."))`. — stgatilov, Jan 14 '17 at 17:36
I think a better question is, is it possible to get a conversion where invalid input bytes result in some specified replacement? That would be useful. Getting an exception isn't useful to me (I can imagine that someone thinks it's useful, since it is the behavior, but really, translate 4 GB of text and a little problem with the last byte and lose all, that's not useful to me). — Cheers and hth. - Alf, Jan 14 '17 at 17:38
@KerrekSB: Yes, it throws on error. But it does *not* always get error on invalid UTF-8 input. At least on my compiler. You can run it and see yourself. — stgatilov, Jan 14 '17 at 17:41
@stgatilov: That's probably a QoI issue, or a even library implementation bug. — Kerrek SB, Jan 14 '17 at 17:45
@KerrekSB: Well, it also converts the string successfully on [ideone](http://ideone.com/hRl7wA) (on GCC 5.1). Moreover, when I convert the result back, it also goes successfully. But I see only 7 bytes as result. — stgatilov, Jan 14 '17 at 17:53
I'm afraid that you need to use *explicitly* something like [`Char.IsLowSurrogate()` and `Char.IsHighSurrogate()` and/or `Char.IsSurrogatePair()` methods](https://msdn.microsoft.com/en-us/library/xcwwfbb8(v=vs.110).aspx?cs-save-lang=1&cs-lang=cpp#code-snippet-1) to check string validity. Unfortunately, I donť know their equivalents in a non-`.NET` environment. — JosefZ, Jan 15 '17 at 11:02

score 0 · Answer 1 · answered Oct 02 '22 at 16:16

0

In the official UTF-8 doc https://www.ietf.org/rfc/rfc3629.txt

The first octet of a multi-octet sequence indicates the number of octets in the sequence - so you determine if the length is correct.
The octet values C0, C1, F5 to FF never appear - so you check that they don't appear in the UTF-8 string

answered Oct 02 '22 at 16:16

Mechi

11
3

Well, there are many good ways to decode UTF-8 with full validation. My favorite is [state-machine based code by Bjoern Hoehrmann](https://bjoern.hoehrmann.de/utf-8/decoder/dfa/). The question is whether C++ standard includes anything which does it out-of-the-box and why C++ feature which should do it does not work correctly. – stgatilov Oct 03 '22 at 18:53

Check if UTF-8 string is valid in modern C++

1 Answers1