3

I am writing a function that extracts unicode characters from a string one at a time. The argument is reference to a pointer to a char which the function increments to the next character prior to returning a value. Here is the entire function:

uint16_t get_char_and_inc(const char *&c) {
  uint16_t val = *c++;
  if ((val & 0xC0) == 0xC0)
    while ((*c & 0xC0) == 0x80)
      val = (val << 8) | *c++;
  return val;
}

As many have pointed out, this UTF-8 decoder is not technically correct, it is limited to 16-bits codes and it does not remove the encoding bits, but it is sufficient for my limited graphics library for microcontrollers :)

The complexity of this function is irrelevant to the question, so assume it simply is this:

uint16_t get_utf8_char_and_inc(const char *&c) {
  return *c++;
}

The problem I am having is that I would like it to work for both char * and const char*, i.e.:

void main() {
  const char cc[] = "ab";
  get_char_and_inc(cc);
  printf(cc);
  
  char c[] = "ab";
  get_char_and_inc(c); // This does not compile
  printf(c);
}

Expected output:

b
b

However, the second call gives me the error:

invalid initialization of non-const reference of type 'const char*&' from an rvalue of type 'const char*'

There are several questions on stackoverflow about this particular error message. Usually they regard passing a const char* as a char *, which is illegal. But in this case, I am going from a char * to a const char*. I feel like this should be legal as I am simply adding a guarantee not to modify the data in the function.

Reading through other answers, it appears the compiler makes a copy of the pointer, making it into a temporary r-value. I understand why this may be necessary in non-trivial conversions, but it seems like here it should not be necessary at all. In fact, if I drop the "&" from the function signature, it compiles just fine, but of course, then the pointers passed by value and the program prints "ab" instead of "b".

Currently, to make this work, I have to have the function twice, one taking const char *&c and another taking char *&c. This seems inefficient to me as the code is exactly the same. Is there any way to avoid the duplication?

user207421
  • 305,947
  • 44
  • 307
  • 483
marciot
  • 31
  • 2
  • would this help https://stackoverflow.com/questions/1863094/pass-strings-by-reference-in-c ? – walid barakat Oct 13 '21 at 22:00
  • @walid barakat: No, I am not just passing a string pointer to the function, I am trying to pass a reference to the string pointer (so that I can increment the pointer inside the function). And the code does work, the issue has to do with needing to write two functions, one that works with `const char *` and another for `char *` – marciot Oct 13 '21 at 22:07
  • 2
    `char *c = "ab";` is not legal since C++11, though some compilers *may* allow it as an extension. Perhaps you meant `char c[] = "ab";` instead? If you want a function to take multiple types, make the function be a template, or use `std::variant`, or use type-erasure techniques. – Remy Lebeau Oct 13 '21 at 22:08
  • @RemyLebeau: Alright, makes sense, but suppose "c" was `char c[3] = {'a', 'b', '\0'};` – marciot Oct 13 '21 at 22:12
  • @Yksisarvinen: Yes, I do get an error. For clarification, it is a gcc compiler for an ESP32 (1.22.0-97-gc752ad5-5.2.0/bin/xtensa-esp32-elf-g++), so maybe this is a bug in this particular compiler? I mean, if this was merely a compiler error, I wouldn't feel too bad about having to work around it by duplicating the function. – marciot Oct 13 '21 at 22:15
  • @RemyLebeau: Yes, I know there are workaround. I guess I am curious why the compiler does not allow me to pass "char *&c" into "const char *&c" the same way as it allows me to pass "char *c" into "const char *c". Maybe the compiler just isn't smart enough to catch this corner case? – marciot Oct 13 '21 at 22:20
  • @marciot Yeah, once I replaced the invalid `char* c = "ab";` with array, I got the same error, so I removed my comment. I'm not exactly sure why this is an rvalue, perhaps pointer decay makes it an rvalue (which kind of makes sense, modifing pointer to array is a weird thing to do, just imagine the confusion of calling `c[0]` after `inc()`). I'm not sure what is the actual reasoning behind that or what is the proper workaround. – Yksisarvinen Oct 13 '21 at 22:23
  • 1
    @marciot on a side note, your `get_char_and_inc()` is not decoding UTF-8 correctly, not even close. And its return value should be `uint32_t` if you are trying to decode a Unicode codepoint. – Remy Lebeau Oct 13 '21 at 22:30
  • @Yksisarvinen: I admit it is a bit odd. The idea is that call the function multiple times and it extracts one character code at a time, since these code points are of varying length, the function has to increment the pointer. – marciot Oct 13 '21 at 22:35
  • @marciot: "*The idea is that call the function multiple times and it extracts one character code at a time*" But it doesn't do that. A Unicode "character code" is 32-bits in size, not 16. Your function assumes that the UTF-8 only handles the first 64K of codepoints. – Nicol Bolas Oct 13 '21 at 22:42
  • @RemyLebeau: Just after I posted the code, I noticed that it probably ought to be uint32_t. But this code is actually used in microcontrollers and my graphics library only supports a limited number of two-byte codepoints anyway so it actually works for what I am using it for, even if incorrect in a general sense. I based the implementation on this https://en.wikipedia.org/wiki/UTF-8 . I also don't bother removing the extra bits since my character rendering code simply deals with them. So maybe I'll add the disclaimer that my code isn't really correct in a general sense. – marciot Oct 13 '21 at 22:44
  • @NicolBolas: See the comment above :) Yes, I should add the disclaimer that the code is technically incorrect, but it works for my graphics library which only supports a few code points. – marciot Oct 13 '21 at 22:47

4 Answers4

2

char* and const char* are not the same type, and you can't mix types in a reference, it has to be an exact match. That is why you can't pass a char* pointer, or a char[] array, or a const char[] array, etc to a const char*& reference. They simply do not match the type expected.

In this case, to make get_char_and_inc() be a single function that can handles multiple reference types, make it a template function, eg:

template<typename T>
uint16_t get_char_and_inc(T* &c) {
  return *c++;
}

int main()
{
  const char *cc = "ab";
  printf("%p\n", cc);
  get_char_and_inc(cc); // deduces T = const char
  printf("%p\n", cc); // shows cc has been incremented
  
  char c[] = "ab";
  char *p = c;
  printf("%p\n", p);
  get_char_and_inc(p); // deduces T = char
  printf("%p\n", p); // shows p has been incremented

  return 0;
}

Online Demo

Remy Lebeau
  • 555,201
  • 31
  • 458
  • 770
  • I'll accept this as the authoritative answer. It works, although it is effectively the same as duplicating the function, just asking the compiler to do in a more general way. I wouldn't worry so much about it except that this is code for a microcontroller and program memory is limited. I suspect the exact same assembly code is being generated for both functions, which is a shame. – marciot Oct 13 '21 at 22:58
  • @marciot The only way to avoid that really is to type-erase the input parameter, or use an ugly and potentially illegal type-cast. Otherwise, change the function's design, for instance by outputing the `uint16_t` in a reference parameter so the function is free to then `return` the incremented pointer, then have it take a non-reference `const char*` so it can accept `char*` as input. – Remy Lebeau Oct 13 '21 at 23:07
  • Having it return the incremented pointer might actually avoid the issue Yksisarvinen warned about, where the caller might be surprised to see the pointer being incremented. I will consider making the pointer the return value while making the character the reference. While I am at it, I will also consider modifying the function to return a '?' character when it encounters a three or four byte UTF-8 character, as this function does indeed return gibberish in those cases. Thank you for the suggestions! – marciot Oct 13 '21 at 23:24
1

If you're worried about the program size you can add a static inline overload like this:

uint16_t get_char_and_inc(const char *&c);

static inline uint16_t get_char_and_inc(char *&c) {
    const char *cc = c;
    uint16_t r = get_char_and_inc(cc);
    c = const_cast<char*>(cc);
    return r;
}

Any optimizing compiler worth the title will collapse it down to nothing.

yuri kilochek
  • 12,709
  • 2
  • 32
  • 59
0

You could go functional and return a tuple, e.g. (demonstrating std::get and structured binding):

#include <iostream>
#include <tuple>
#include <string.h>

std::tuple<int, char const*> get_char_and_inc(char const* c) {
  int x = static_cast<int>(*c);
  c++;
  return {x, c};
}

int main() {
  char const* cc = "ab";
  auto v1 = get_char_and_inc(cc);
  std::cout << std::get<0>(v1) << ", " <<
               std::get<1>(v1) << "\n";

  char* c = strdup("ab");
  auto [val2, next_c2] = get_char_and_inc(c);
  std::cout << val2 << ", " <<
               next_c2 << "\n";
  free (c);
  return 0;
}

See demo: https://godbolt.org/z/9EWf5zWaj - from there you can see that with -Os the object code is pretty compact (the only real bloat is for std::cout)

Den-Jason
  • 2,395
  • 1
  • 22
  • 17
0

The problem is that you are passing the pointer to the string by reference. You can do it this way but as you found out then you can't mix const char* and char*. You can create a const char* call it pCursor and pass that in instead. I would recommend writing your function like below. This way you pass a reference to the value and you return a const char* pointer to the next character. I would also recommend not incrementing the pointer directly and instead using an index value.

const char* get_char_and_inc(const char* pStr, uint16_t& value)
{
    int currentIndex = 0;

    value = pStr[currentIndex++];

    if ((value & 0xC0) == 0xC0)
    {
        while ((pStr[currentIndex] & 0xC0) == 0x80)
        {
            value = (value << 8) | pStr[currentIndex++];
        }
    }

    return &pStr[currentIndex];
}

Then your main becomes.

int main()
{
    const char cc[] = "ab";

    uint16_t value;

    const char* pCursor = get_char_and_inc(cc, value);

    printf(pCursor);

    char c[] = "ab";

    pCursor = get_char_and_inc(c, value);

    printf(pCursor);
}

If your don't want to change your get_char_and_inc function then you can change your main to this:

int main()
{
    const char cc[] = "ab";

    const char* pCursor = cc;

    get_char_and_inc(pCursor);
    printf(pCursor);

    char c[] = "ab";

    pCursor = c;

    get_char_and_inc(pCursor); // This does not compile
    printf(pCursor);
}