8

Google uses UTF-8 it as default for their very popular encoder. From what I can see they don't even add the byte order mark.

The problem is that most scanners still seem to use JIS8 (QR 2000) instead of iso-8859 (QR 2005) as default, so it mostly does not work to use iso-8859 for encoding.

It seems like utf-8 is the only choice even if it is against the specification.

edit: I will go with utf-8 without ECI and without BOM. Against all spec and spirit but works best at the moment.

Gonzo
  • 2,023
  • 3
  • 21
  • 30

2 Answers2

12

The specification says that ISO-8859-1 is the default for byte-mode encoding. However in practice, yes, you'll see a lot of Shift-JIS in Japan, or UTF-8.

UTF-8 is the right choice. To do it properly, you need to put some indication in the stream that it's UTF-8. The spec does allow for this. You need to precede the byte segment with an ECI segment that indicates UTF-8.

The zxing encoder will do that for you if you send it a hint that the encoding is UTF-8.

Dan Lenski
  • 76,929
  • 13
  • 76
  • 124
Sean Owen
  • 66,182
  • 23
  • 141
  • 173
  • 1
    After some tests: The google encoder (and thus also the ZXing online encoder) does not seem to use ECI. More important, a lot of scanner apps do not understand the ECI segment. I think it is better to leave it out. – Gonzo Mar 14 '12 at 12:32
  • Which leaves one last question: use BOM? – Gonzo Mar 14 '12 at 12:33
  • 3
    @Phelix: No do not ever use a BOM in UTF-8 streams. It screws lots of things up in certain environments. – tchrist Mar 14 '12 at 13:12
  • Well I know that our decoder will read it properly. The concern is whether others out there choke on the BOM (or the ECI segment). Testing is the only real way to know. I personally would use one or the other, and would choose the ECI. – Sean Owen Mar 14 '12 at 13:12
  • 2
    Well, I tested and ECI did not work. For the BOM I will probably go with tchrist and leave it out, too. This is from the UTF-8 Wikipedia page: "Because checking if text is valid UTF-8 is very reliable (the majority of random byte sequences are not valid UTF-8) such use [BOM] should not be necessary." – Gonzo Mar 14 '12 at 15:56
  • Also the BOM in the beginning interfered with proper vCard detection - even with ZXing it seems (hope I got the BOM right). – Gonzo Mar 14 '12 at 16:04
  • 3
    Yes I'm wrong about it. The BOM becomes an (invisible) character U+FEFF in the parsed stream. I will see how hard it would be to ignore it. But you should probably omit the BOM. – Sean Owen Mar 14 '12 at 16:43
6

BOM does not help

My experience shows that BOM does not help. If a QR scanner cannot display a string from a properly encoded UTF-8 string (8-bit byte mode in the data stream), even with an ECI, adding a BOM does not make any difference.

Scanners fail even on properly encoded UTF-8

As an example of a scanner that cannot display a proper UTF-8 string, take Xiaomi phones with MIUI Global v11.0.3 (with their native scanner application). These phones cannot correctly show a string of Cyrillic characters encoded in UTF-8 even if this charset is specified in the ECI. The Cyrillic characters are shown as question marks. But if you add a Chinese/Japanese character (e.g. 日) to the Cyrillic text, the whole text will be displayed correctly by Xiaomi. This is regardless of BOM.

These are actual characters that matter, not the encoding

You have supposed that it is better to use UTF-8 instead of ISO-8859-1 in QR codes, because ISO-8859-1 was not the default encoding in earlier QR code standard published in 2000 (ISO/IEC 18004:2000). That standard did specify 8-bit Latin/Kana character set in accordance with JIS X 0201 (JIS8 also known as ISO-2022-JP) as default encoding for 8-bit mode, while the updated standard published in 2005 did change the default to ISO-8859-1. So, you have supposed that “it mostly does not work to use iso-8859 for encoding”. It depends on whether US-ASCII characters should be enough for you (to be specific, the printable ANSI X3.4-1986 characters in the range of 20-7E) and you do not need ISO-8859-1 characters with umlaut/diaeresis used in languages such as Catalan, French, Galician, German, Occitan and Spanish.

If you only need US-ASCII, then it is safe to use ISO-8859-1 without any ECI rather than UTF-8 with an ECI. Anyway, the octet string of US-ASCII characters in range of 20-7E will be the same whether it is encoded as ISO-8859-1 or UTF-8. The heuristics software used by scanners should be able to automatically figure out the character set used if you are only using the US-ASCII characters. If you need characters with umlaut/diaeresis, then go with UTF-8. This is not because of default encoding has changed from JIS X 0201 to ISO-8859-1 between 2000 and 2005 revisions of the QR code standard, but because QR scanners use heuristics to automatically detect the encoding, and this heuristics in some cases fail.

Why QR scanners use heuristics to detect encoding

As you know, there are 4 modes of storing text in a QR code: (1) numeric, (2) alphanumeric, (3) 8-bit, and (4) Kanji.

So, QR code standard does not inherently support UTF-8. To use UTF-8 encoding (instead of the default “ISO-8859-1” or “JIS8”) in the 8-bit string, the implementation has to insert an ECI (Extended Channel Interpretations) before that string. ECI is an optional, additional feature for a QR Code, but it was defined in earliest QR code standard at least in 2000. ECI enables data encoding using character sets other than the default. It also enables other data interpretations (e.g. compacted data using defined compression schemes) or other industry-specific requirements to be encoded.

The ECI protocol is defined in a specification developed by AIM, Inc, and is not available for free but can be purchased at $50 at https://www.aimglobal.org/technical-symbology.html

Scanners may ignore the ECI protocol

Unfortunately, not all QR scanners can handle the ECI protocol, even in such a basic thing as changing default encoding to UTF-8. Most implementations use heuristics, i.e. one or another character encoding detection algorithm for guessing the encoding, even if the encoding is specified explicitly in the ECI of the decoded QR code. They use heuristics not only due to the change in default encoding from JIS8 to ISO-8859-1 between 2000 and 2005. The main reason is lack of proper ECI protocol support, probably caused by the fact that the QR code specification and the AIM ECI protocol specification are different documents. Some QR encoders do not specify character encoding via ECI and use different encodings for a 8-bit string (JIS8, Shift_JIS, ISO-8859-1, UTF-8), so the scanners have to cope with that.

You wrote that “it seems like utf-8 is the only choice”, but the scanner use heuristics that may fail even with UTF-8 as in the Xiaomi example I have given. You also wrote thet UTF-8 “is against the specification”, but this is so only if UTF-8 encoding is not explicitly specified via ECI.

An alternative to ECI and UTF-8, but not a complete cure

P.S. There is an alternative to using ECI. You can encode Latin characters with umlaut/diaeresis or Cyrillic characters using the “Kanji” mode. In this mode, the “Shift_JIS” is used to encode JIS X 0208 characters in ranges 8140-9FFC and E040-EBBF. Here you cannot encode characters in other ranges like space by byte code 20 but you can instead encode it as JIS X 0208 row 1 column 21, i.e. 2121). Since JIS X 0208 has rows for Roman (row 3), Greek (row 6) and Cyrillic (row 7), as well as special characters like punctuation (rows 1 & 2), you can encode Latin characters with umlaut/diaeresis or Cyrillic text (including spaces and punctuation) entirely in JIS character ranges 8140-9FFC and E040-EBBF. No ECI extension is needed in this case. But there is no guarantee that the heuristics in the scanner software will not break your properly encoded text.

Conclusion

Using UTF-8 and specifying it via ECI is not a complete cure (because some scanners will use error-prone heuristics in this case anyway), but at least it helps with compliant scanners, unlike BOM that does not help at all.

Maxim Masiutin
  • 3,991
  • 4
  • 55
  • 72
  • 1
    The internal encoding modes to which you refer are only a means to optimally compress the bitstream for common types of data. Unlike with ECI, the transmission protocol for the reader does not indicate the change of encoding mode to the host, so what the application receives is a byte array that must be interpreted according to the code page that is in effect: Latin-1 unless an ECI sequence changes it. The AIM Technical Symbology Committee and members of "WG1" are refreshing the ECI standards and promoting its use: https://www.linkedin.com/pulse/enhanced-channel-interpretation-terry-burton/ – Terry Burton Sep 27 '20 at 16:57
  • 1
    @TerryBurton At least going off the 2015 standard for QR codes, what you said is only true if an ECI is used. If there is no ECI, the different modes really do refer to distinct character sets. Quoting from section 7.3.1: "The modes defined below are based on the character values and assignments associated with the default ECI. When any other ECI is in force (...), the byte values rather than the specific character assignments shall be used to select the optimum data compaction mode." – ndkrempel Dec 28 '21 at 19:23
  • 1
    @TerryBurton On the other hand, the "default ECI" is supposed to be Latin-1, so maybe I'm misinterpreting it here. Is the intent that using Kanji mode (with no ECI specified, so defaulting to Latin-1) is more-or-less useless as it would encode weird combinations of Latin-1 characters rather than Shift JIS? On second read, it may be trying to say that the default ECI of Latin-1 is only relevant to interpreting bytes produced by bytes mode, so Kanji mode does still refer to Shift JIS by default, hard to tell.. – ndkrempel Dec 28 '21 at 19:33
  • 1
    @TerryBurton The following text from section 7.3.6 "Kanji mode" is confusing things further: "When the character set specified for 8-bit byte mode makes use of byte values in the ranges 81 to 9F and/or E0 to EB, it may not be possible to use Kanji mode unambiguously". It sounds like the spec itself hasn't made its mind up on whether Kanji mode is merely a transfer encoding or has an implied character set (why say "may" and talk about ambiguity if you intend it to have a precise meaning as bytes), but it does sound like your interpretation is more likely the correct one. – ndkrempel Dec 28 '21 at 20:17
  • 1
    @ndkrempel The 2D symbology specifications fit into an overall framework in which there is an abstraction of the interface between the reader and the host that models a unidirectional, wired connection that can transport byte values. This is referred to as the "byte channel" over which the "transfer protocol" is executed. In practise, a realisation of the interface may be RS232, keyboard wedge, various USB transports, ... each with their own limitations. It is the solution provider's responsibility to relate each to the model interface so that the environment's messages "survive the channel". – Terry Burton Dec 30 '21 at 21:35
  • 2
    For the model configuration it is not possible for the application or host driver to "peek" over the unidirectional interface into into the barcode's internal codewords / bitstream in order to determine the precise modes that are in effect. So the decoded message is that is available and to avoid ambiguity it is determined that this is interpreted as Latin-1, unless ECI is in effect (as indicated by the symbology identifier at the start of the message). – Terry Burton Dec 30 '21 at 21:41
  • 2
    In mobile apps or embedded applications there is no physical distinction between the reader and the host running some application so this model interface is invisible practical purposes. So the differentiation between "compression modes (Kanji, etc)" (for codeword optimisation purposes) and character encoding (for interpretation purposes) is no longer obvious. This leads to layering violation in which developers use whatever information is available to peek into the internal encoding of the QR Code to infer meaning where none is intended. – Terry Burton Dec 30 '21 at 21:45
  • 2
    A goal of the QR Code 2005 standard was to rationalise elements of the earlier QR Code standard that did not fit the assumptions of the prevailing framework. (IIRC earlier version omitted to define the default character encoding when ECI wasn't in effect, for example.) The problem was/is that lots of implementation pipe the message data into a toolkit's text widget which results in auto-detection of the effective character set which is disastrous from the perspective of barcodes intending to be a sound and complete transport, rather than something that mostly seems to works until it doesn't. – Terry Burton Dec 30 '21 at 21:52
  • 2
    Things are gradually improving with more apps forcing Latin-1 encoding for non-ECI messages. When ECI support is ubiquitously available you are able to specify that an output message or message segment is intended to be interpreted as Shift JIS (ECI \000020) and also have those message bytes encoded efficiently within the internal bitstream of the QR Code symbol using Kanji compression mode. More recent barcodes such as HanXin have support for several additional compression modes for multibyte languages, but use the same ECI protocol over the model interface framework to denote interpretation. – Terry Burton Dec 30 '21 at 22:00
  • 1
    @TerryBurton Thank you for the detailed response, that all makes sense. I only wish the QR Code specification made that background and the proper interpretation you describe more explicit (even if it's not what always occurs in practice) and avoided some of the fuzzy wording it uses! Perhaps a result of being an edited form of earlier specifications and some clean ups of the text were overlooked. – ndkrempel Dec 31 '21 at 03:06
  • 1
    @TerryBurton In the QR Code 2000 standard the default ECI is ECI 000020 (JIS8 and Shift JIS character sets). – timakro May 14 '23 at 17:47
  • 1
    @timakro Indeed, thanks for the clarification. QR Code 2005 and later versions (now referred to as simply "QR Code") subsequently switched the default interpretation to Latin-1. – Terry Burton May 14 '23 at 21:10