0

I want to decode encoded urls. As an example the letter ö is encoded as "%C3%B6" corresponding to its hexadecimal utf-8 encoding 0xc3b6 (50102).

In need to know now how to print this value as ö on the console or into a string buffer.

Simply casting to char, wchar_t, char16_t or char32_t and printing to cout or wcout didn't work.

The closest I have got was by using its utf-16 representation 0x00f6. The folowing code snippet prints ö

#include <codecvt>
#include <iostream>
#include <locale>

int main() {
  std::wstring_convert<std::codecvt_utf8<char16_t>, char16_t> convert;
  std::cout << convert.to_bytes(0x00f6) << '\n';
}

I need now either a way to calculate 0x00f6 from 0xc3b6 or another approach to decode the url.

sv90
  • 522
  • 5
  • 11
  • `'ö'` is mostly `'\x50\x102'`, so you won't have it in a char as is. – Jarod42 Jan 06 '19 at 09:52
  • @Jarod42 Is it possible to have it in a string? – sv90 Jan 06 '19 at 10:18
  • `ö` lies outside the ASCII range, therefore in either [precomposed or decomposed form](https://en.wikipedia.org/wiki/Precomposed_character) it'll take more than one byte. You must use a wide char like `L'ö'` or print it as a string instead of a char – phuclv Jan 06 '19 at 10:30
  • @Jarod42 50102 is the decimal value. It's 0xC3B6 in hexadecimal which is the UTF-8 representation of [U+00F6](https://www.fileformat.info/info/unicode/char/00f6/index.htm) – phuclv Jan 06 '19 at 10:33
  • @phuclv: Indeed, I use wrong values :-/ My point was mostly that multi byte character doesn't fit in one `char`. – Jarod42 Jan 06 '19 at 17:32
  • @Jarod42 Is there a way to calculate `0x00f6` from `0xc3b6`? I couldn't find a relation between both Unicode representations. If there was one I could use the classes in `` to print `ö` from `0xc3b6` – sv90 Jan 06 '19 at 19:22
  • Better to make it so that you don't have this multicharacter literal in the first place. What is it that you're trying to do? – Lightness Races in Orbit Jan 06 '19 at 19:30
  • @LightnessRacesinOrbit I want to decode an encoded url where I have strings like %C3%B6 that I want to print as ö – sv90 Jan 06 '19 at 19:38
  • @sv90 Right, that's far more interesting and does not resemble your current question at all. I suggest you edit it to describe the real problem. – Lightness Races in Orbit Jan 06 '19 at 19:39
  • @LightnessRacesinOrbit Thanks for the advice – sv90 Jan 06 '19 at 19:42
  • https://stackoverflow.com/q/21216307/560648 – Lightness Races in Orbit Jan 06 '19 at 20:00
  • @LightnessRacesinOrbit Thanks but that UriHelper somehow didn't work for "%C3%B6" – sv90 Jan 06 '19 at 20:27
  • Can't do much with the information "didn't work" – Lightness Races in Orbit Jan 06 '19 at 20:28
  • @LightnessRacesinOrbit It printed `?` instead of `ö` – sv90 Jan 06 '19 at 20:30
  • Just because you can decode a URL to a UTF-8 string doesn't guarantee that your console can display it properly. Also, there is no guarantee that a URL uses UTF-8 for encoding characters. The encoding of a URL is dependant on the server that the URL belongs to. This is a shortcoming that IRIs were designed to address when replacing URIs/URLs. Unicode is an after-thought in URLs, but is part of the design of IRIs – Remy Lebeau Jan 07 '19 at 16:54

2 Answers2

1

In POSIX you can print UTF8 string directly:

std::string utf8 = "\xc3\xb6"; // or just u8"ö"
printf(utf8);

In Windows, you have to convert to UTF16. Use wchar_t instead of char16_t, even though char16_t is supposed to be the right one. They are both 2 bytes per character in Windows.

You want convert.from_bytes to convert from UTF8, instead of convert.to_bytes which converts to UTF8.

Printing Unicode in Windows console is another headache. See relevant topics.

Note that std::wstring_convert is deprecated and has no replacement as of now.

#include <iostream>
#include <string>
#include <codecvt>
#include <windows.h>

int main() 
{
    std::string utf8 = "\xc3\xb6";

    std::wstring_convert<std::codecvt_utf8<wchar_t>, wchar_t> convert;
    std::wstring utf16 = convert.from_bytes(utf8);

    MessageBox(0, utf16.c_str(), 0, 0);
    DWORD count;
    WriteConsole(GetStdHandle(STD_OUTPUT_HANDLE), utf16.c_str(), utf16.size(), &count, 0);

    return 0;
}

Encoding/Decoding URL

"URL safe characters" don't need encoding. All other characters, including non-ASCII characters, should be encoded. Example:

std::string encode_url(const std::string& s)
{
    const std::string safe_characters = 
        "ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789-._~";
    std::ostringstream oss;
    for(auto c : s) {
        if (safe_characters.find(c) != std::string::npos)
            oss << c;
        else
            oss << '%' << std::setfill('0') << std::setw(2) << 
                std::uppercase << std::hex << (0xff & c);
    }
    return oss.str();
}

std::string decode_url(const std::string& s) 
{
    std::string result;
    for(std::size_t i = 0; i < s.size(); i++) {
        if(s[i] == '%') {
            try { 
                auto v = std::stoi(s.substr(i + 1, 2), nullptr, 16);
                result.push_back(0xff & v);
            } catch(...) { } //handle error
            i += 2;
        }
        else {
            result.push_back(s[i]);
        }

    }
    return result;
}
Barmak Shemirani
  • 30,904
  • 6
  • 40
  • 77
1

Thanks for all the help. Here is what I have come up with. Maybe it will help someone else

#include <iomanip>
#include <iostream>
#include <sstream>

#include <cstdint>

std::string encode_url(const std::string& s) {
  std::ostringstream oss;
  for (std::uint16_t c : s) {
    if (c > 0 && c < 128) {
      oss << static_cast<char>(c);
    }
    else {
      oss << '%' << std::uppercase << std::hex << (0x00ff & c);
    }
  }
  return std::move(oss).str();
} 

int parse_hex(const std::string& s) {
  std::istringstream iss(s);
  int n;
  iss >> std::uppercase >> std::hex >> n;
  return n;
}

std::string decode_url(const std::string& s) {
  std::string result;
  result.reserve(s.size());
  for (std::size_t i = 0; i < s.size();) {
    if (s[i] != '%') {
      result.push_back(s[i]);
      ++i;
    }
    else {
      result.push_back(parse_hex(s.substr(i + 1, 2)));
      i += 3;
    }
  }
  return result;
}

There is still room for optimizations but it works :)

sv90
  • 522
  • 5
  • 11
  • there's no need to do URL encode if you just want to print. URL encode has almost no usage outside of URLs – phuclv Jan 07 '19 at 15:20
  • I did it only for completeness and testing :) – sv90 Jan 07 '19 at 16:23
  • Your decode function should work fine. But the encode function can run in to trouble. For example `; / ? : @ = &` are reserved characters. See suggested answer. – Barmak Shemirani Jan 07 '19 at 18:49