How to convert UTF-8 text from file to some container which can be iterable and check every symbol for being alphanumeric in C++?

Question

I read around 20 questions and checked documentation about it with no success, I don't have any experience writing code handling this stuff, I always avoided it.

Let's say I have a file which I am sure always will be UTF-8:

á

Let's say I have code:

  wifstream input{argv[1]};
  wstring line;
  getline(input, line);

When I debug it, I see it's stored as L"Ã¡", so basically it's not iterable as I want, I want to have just 1 symbol to be able to call let's say iswalnum(line[0]).

I realized that there is some codecvt facet, but I am not sure, how to use it and if it's the best way and I use cl.exe from VS2019 which gives me a lot of conversion and deprecation errors on the example provided: https://en.cppreference.com/w/cpp/locale/codecvt_utf8

I realized that there is a from_bytes function, but I use cl.exe from VS2019 which gives me a lot of errors on the example provided, too: https://en.cppreference.com/w/cpp/locale/wstring_convert/from_bytes

So how to correctly read the line with let's say that letter (symbol) á and be able to iterate it as some container with size 1 so some function like iswalnum can be simply called?

EDIT: When I fix the bugs in those examples (for c++latest), I still have Ä‚Ë‡ in UTF-8 and Ăˇ in UTF-16.

@rustyx Nice help, I will try it. – Lukas Salich Jun 22 '20 at 19:04 — Lukas Salich, Jun 22 '20 at 19:04

score 1 · Accepted Answer · answered Jun 22 '20 at 21:50

1

L"Ã¡" means the file was read with a wrong encoding. You have to imbue a UTF-8 locale before reading the stream.

  wifstream input{argv[1]};
  input.imbue(std::locale("en_US.UTF-8"));
  wstring line;
  getline(input, line);

Now wstring line will contain Unicode code points (á in your case) and can be easily iterated.

Caveat: on Windows wchar_t is deficient (16-bit), and is good enough for iterating over BMP only.

answered Jun 22 '20 at 21:50

rustyx

80,671
25
200
267

Thanks, I already realized that I had the same problem in the past, because I remember searching for a suitable fstream constructor with locale, there isn't any. – Lukas Salich Jun 23 '20 at 01:26

How to convert UTF-8 text from file to some container which can be iterable and check every symbol for being alphanumeric in C++?

1 Answers1