Ensuring consistency when encoding to UTF8 from extended ASCII

Question

Maybe this is a non-issue but I look to the collected wisdom of SO to help me find out.

We're trying to ensure encodings are consistent across platforms. The way to go is clearly UTF8. However, some platforms unfortunately use extended ASCII (typically some form of Windows codepage), We're concerned that when encoding something with say, an umlaut, from a Windows codepage to UTF8, there are multiple possible choices within UTF8 for the character.

On a different platform (Linux, Mac OS), how do we ensure that the UTF8 character chosen there is consistent?

As I said, maybe this is a non-issue. Maybe there is some standard mapping I'm unaware of. We haven't seen any problems but a colleague just raised the concern so I'm on the hunt for information.

Thank you all in advance.

I had a similar problem embedding a QR code inside a view. I realized I missed using the `base64_encode()` method like so: `` (from the `use SimpleSoftwareIO\QrCode\Facades\QrCode;`) — Pathros, Feb 18 '22 at 19:23

score 1 · Answer 1 · answered Oct 09 '12 at 23:49

1

As long as you properly convert original text to Unicode first and than use Utf8 to store/transfer data there should be no problems.

answered Oct 09 '12 at 23:49

Alexei Levenkov

98,904
14
127
179

Makes sense. Our server code is python so unicode at that end is easy enough. Any idea if ICU is still the standard for handling unicode in C++? – Endophage Oct 09 '12 at 23:55

score 1 · Answer 2 · answered Oct 10 '12 at 06:11

The Unicode Consortium has compiled a set of mapping tables. Nominally informational, they constitute a de facto standard. Moreover, many of the mappings there reflect formal standards, as it has become normal to define any new character encoding in terms of Unicode, i.e. by specifying the Unicode number (and/or Unicode name) of each character.

Once a character has been mapped to Unicode (i.e., to a Unicode code point, or Unicode number), its encoding in each Unicode encoding, such as UTF-8, has been defined unambiguously.

So the issue is how you ensure that the conversion routines you use work according to those tables. Using ICU can be regarded as safe in this respect.

P.S. There is no extended ASCII. There are various character encodings, some of which coincide with ASCII in the range from 0 to 0x7F, some don’t.

Thanks. I'm aware there is no *standard* definition of "extended ASCII" otherwise I would capitalised the "Extended". However, it is a generally recognised term to encompass character encodings that have made use of the 8th bit. — Endophage, Oct 10 '12 at 16:55

Ensuring consistency when encoding to UTF8 from extended ASCII

2 Answers2