How to encode East-European (Polish) signs using simple escape sequences?

Question

I'm developing an embedded application in C, which has to conform to MISRA standards. It will involve the use of strings containing Polish signs (ąęćłńśźż). I tried encoding them using octal/hex escape sequences:

dictionary[archive_error] = "B" "\x88" "ąd pamieci";

but those are prohibited by rule 4.1. of MISRA-C 2004. This rule is required.

My question is: is it possible, and how, to encode this character set using only simple escape sequences of ISO/IEC 9899?

I don't think you have escape sequences for such chars (but the ones involving their numerical value). What prevent you from using the actual chars in the string? — Jack, Apr 13 '15 at 10:31
@Jack When it comes to non-standard characters beyond the classic 7-bit ASCII, you now and then encounter situations where the font table of the text editor and/or desktop OS and/or compiler and/or the target system are different. It would be ideal if they all used Unicode, but this isn't always the case. — Lundin, Apr 13 '15 at 11:06

Lundin · Accepted Answer · 2015-04-13T11:17:03.447

3

In is not clear which MISRA version you are using.

Rule 4.1 of MISRA-C:2004 simply prohibits non-standard escape sequences. In MISRA-C:2004 TC1 this was later changed to ban all hexadecimal and octal escape sequences (they have implementation-defined behavior unless you are careful). Apparently this rule and its supposed correction was a bit of a goof-up from the committee.

The rule has been properly fixed in the latest MISRA-C:2012, where rule 4.1 states that escape sequences shall be terminated, either with the start of a new escape sequence or with the end of the string literal, just as in your example.

So the code you have posted does not conform to MISRA-C:2004, but it conforms fully to MISRA-C:2012. If you are using the former, I'd just raise a deviation and refer to MISRA-C:2012 rule 4.1.

Otherwise, a work-around is to simply use character literals mixed with integers, instead of string literals:

dictionary[archive_error] = {'B', 0x88u, 'a', ... , '\0'};

edited Apr 13 '15 at 11:17

answered Apr 13 '15 at 11:03

Lundin

195,001
40
254
396

I use 2004 version, included it in edits now. I'll check this out and give feedback – Michał Szydłowski Apr 13 '15 at 12:00
Okay, your workaround seems to be correct, though quite tedious to use, if I have lots of very long strings. That will introduce more mess in the code that any gain I could have on it. However, you made me raise the issue of changing the MISRA standards in the project. – Michał Szydłowski Apr 13 '15 at 12:36
2

@MichałSzydłowski Apart from C99 support, MISRA 2012 also got plenty of "fixes" such as this one. Main issue is usually that you'd have to upgrade your static analyser, which is expensive. At the very least you could purchase a copy of MISRA 2012 and read through it, see if you find some other things there which would make your MISRA implementation easier. – Lundin Apr 13 '15 at 14:45
Yes, I know, it's just that my current development environment (CCS) has a built-in MISRA 2004 compliance check, and integrating a new one means additional time. I will consider that though – Michał Szydłowski Apr 13 '15 at 18:51
The proposed workaround is no more portable than the forbidden solution, save for the fact that MISRA tolerates it. Given `char CAT[5] = "CAT\n"; there are 248,031,000 different byte arrays a conforming implementation with 8-bit signed `char` could produce. If the string were written as `"CAT\x0A"`, there would be only 2,000,250. If it were written as `"\x43\x41\x54\x0A"` there would be only one. The Standard guarantees that sending that array to a text file would cause the bytes to be interpreted as three letters and a newline, but since MISRA forbids using stdout, that guarantees nothing. – supercat May 02 '19 at 18:23
@supercat I have no idea what you are on about. The MISRA rules have absolutely nothing to do with portability, but with accidental, unintended escape sequences such as "\x32ABBA" . – Lundin May 03 '19 at 06:27
@Lundin: My compiler, which I think targets MIRSA-2004, has a separate rule about unterminated escape sequences, which could be averted if either octal escape sequences were accepted and always written as three digits, or by writing the string as e.g. `"\x32""ABBA"`. I think the basis of the rule was that on some systems something like `printf("Foo\012");` might terminate the string with something other than the execution character set's newline character, but that ignores the fact that if the implementation's newline isn't an ASCII LF, code will more likely need an ASCII LF than a newline. – supercat May 03 '19 at 15:30

How to encode East-European (Polish) signs using simple escape sequences?

1 Answers1