1

I read around 20 questions and checked documentation about it with no success, I don't have any experience writing code handling this stuff, I always avoided it.

Let's say I have a file which I am sure always will be UTF-8:

á

Let's say I have code:

  wifstream input{argv[1]};
  wstring line;
  getline(input, line);

When I debug it, I see it's stored as L"á", so basically it's not iterable as I want, I want to have just 1 symbol to be able to call let's say iswalnum(line[0]).

I realized that there is some codecvt facet, but I am not sure, how to use it and if it's the best way and I use cl.exe from VS2019 which gives me a lot of conversion and deprecation errors on the example provided: https://en.cppreference.com/w/cpp/locale/codecvt_utf8

I realized that there is a from_bytes function, but I use cl.exe from VS2019 which gives me a lot of errors on the example provided, too: https://en.cppreference.com/w/cpp/locale/wstring_convert/from_bytes

So how to correctly read the line with let's say that letter (symbol) á and be able to iterate it as some container with size 1 so some function like iswalnum can be simply called?

EDIT: When I fix the bugs in those examples (for c++latest), I still have á in UTF-8 and á in UTF-16.

Lukas Salich
  • 959
  • 2
  • 12
  • 30

1 Answers1

1

L"á" means the file was read with a wrong encoding. You have to imbue a UTF-8 locale before reading the stream.

  wifstream input{argv[1]};
  input.imbue(std::locale("en_US.UTF-8"));
  wstring line;
  getline(input, line);

Now wstring line will contain Unicode code points (á in your case) and can be easily iterated.


Caveat: on Windows wchar_t is deficient (16-bit), and is good enough for iterating over BMP only.

rustyx
  • 80,671
  • 25
  • 200
  • 267
  • Thanks, I already realized that I had the same problem in the past, because I remember searching for a suitable fstream constructor with locale, there isn't any. – Lukas Salich Jun 23 '20 at 01:26