0

I work on Linux. I have to read from the console to char16_t buffer. Currently my code looks like this:

char tempBuf[1024] = {0};
int readBytes = read(STDIN_FILENO, tempBuf, 1024);
char16_t* buf = convertToChar16(tempBuf, readBytes); 

Inside the convert function I use mbrtoc16 std library function to convert each character separately. Is it the only way to read from the console to char16_t buf ? Do you know any alternative solution ?

Irbis
  • 1,432
  • 1
  • 13
  • 39
  • 2
    mbrtoc16 is a good way to do it, the main problem I see with your code is the assumption reading 1024 bytes won't truncate a multi-byte character. You need to pay attention to the return values of mbrtoc16 and handle that scenario with "ungetc" or something similar for the truncated bytes in the case that it finds some – Scott Christensen Jul 16 '23 at 14:25
  • @Scott Christensen Could you please show some example for the truncated bytes ? – Irbis Jul 16 '23 at 21:24
  • sure I'll add it as an answer to make things official – Scott Christensen Jul 17 '23 at 21:05

1 Answers1

2

Multi-byte Characters

The main thing you want to be careful of reading into a fixed-length buffer is accidentally truncating "multi-byte characters" in your "multi-byte string"

What is a multi-byte character you ask? In my environment they're UTF-8 characters. For example, if I run echo $LANG I get en_US.UTF-8. These are exactly what they sound like, they are characters that can be stored over multiple bytes. Anything other than the 7-bit ascii set is stored in 2 or more bytes that follow each other sequentially. If you read only part of the multi-byte character (truncating it) then you end up with garbage on both sides of the read.

So let's see a concrete example:

Example Code

In the complete runnable file below, I purposefully shorten the buffer to only be 5 characters wide so I can easily hold a full 4-byte UTF-8 multi-byte character and a null terminator.

#include <stdio.h>
#include <unistd.h>
#include <string.h>

#define BUF_LEN 5

int main()
{
    /* you do your read assuming some byte length */
    char tempBuf[BUF_LEN] = {0};
    int readBytes = read(STDIN_FILENO, tempBuf, BUF_LEN);

    /* If you try to read from this tempBuffer with %s you'll overrun your
     * buffer since it doesn't have a null terminator, so we'll look at it
     * character by character */
    printf("Printing bytes:\n");
    for(size_t i = 0; i < readBytes; i++)
    {
        printf( "\t%zu) 0x%02x -- %c\n",
                i,
                (unsigned char)tempBuf[i], 
                (unsigned char)tempBuf[i]);
        /* we cast the above to an unsigned char because the extra UTF
         * characters will start with a negative signed char and will not cast
         * correctly to an unsigned int to be used for reading hex values */
    }

    /* so what do we do if we identify a bad byte? we put it back into stdin */
    /* start at the end and search backward to find the most recent ascii
     * character */
    printf("\nlet's back up\n");
    char * p = &tempBuf[BUF_LEN - 1];
    while(((unsigned char)*p) > 127)
    {
        ungetc((unsigned char)*(p--), stdin);
    }
    printf("try again on that character\n");
    memset(tempBuf, 0, BUF_LEN); // set the buffer to zero again so what we 
                                 // read makes sense
    fgets(tempBuf, BUF_LEN, stdin);
    printf("Printing bytes again:\n");
    for(size_t i = 0; i < readBytes; i++)
    {
        printf( "\t%zu) 0x%02x -- %c\n",
                i,
                (unsigned char)tempBuf[i], 
                (unsigned char)tempBuf[i]);
        /* we cast the above to an unsigned char because the extra UTF
         * characters will start with a negative signed char and will not cast
         * correctly to an unsigned int to be used for reading hex values */
    }
    printf("Multi-byte string all at once: \"%s\"", tempBuf);
    
    return 0;
}

Running an example

Taking the above code I can construct an input that I know will break (truncate) a character on purpose, like so, to see what is going on.

scott@scott-G3:~/tmp$ g++ -o stackoverflow_example stackoverflow_example.cpp 
scott@scott-G3:~/tmp$ ./stackoverflow_example 
abcdé
Printing bytes:
    0) 0x61 -- a
    1) 0x62 -- b
    2) 0x63 -- c
    3) 0x64 -- d
    4) 0xc3 -- �

let's back up
try again on that character
Printing bytes again:
    0) 0xc3 -- �
    1) 0xa9 -- �
    2) 0x0a -- 

    3) 0x00 -- 
    4) 0x00 -- 
Multi-byte string all at once: "é

So what happened?

In the example above, I purposefully positioned the UTF-8 character "é", which expands to two bytes 0xC3, 0xA9 such that it would get cut off by your read call. I then used ungetc to put 0xC3 back into stdin, and read it again with it's partner 0xA9. Only when they're next to each other do they make any sense. You see an 0x0a following it which we know and love as '\n' because the read captured my return as well.

  • 1
    Nice explanation, could please also explain why the extra UTF8 characters start with a negative signed char ? Is it just a convention ? – Irbis Jul 19 '23 at 12:01
  • 1
    That is because UTF-8 retained backwards compatibility with the 7-bit ASCII set that had already claimed the positive values. They extended the character set by mapping the unused negative values to be the first part of all multi-byte UTF-8 characters – Scott Christensen Jul 19 '23 at 14:45