8

Premise

  • I have a blob of binary data in memory, represented as a char* (maybe read from a file, or transmitted over the network).
  • I know that it contains a UTF8-encoded text field of a certain length at a certain offset.

Question

How can I (safely and portably) get a u8string_view to represent the contents of this text field?

Motivation

The motivation for passing the field to down-stream code as a u8string_view is:

  • It very clearly communicates that the text field is UTF8-encoded, unlike string_view.
  • It avoids the cost (likely free-store allocation + copying) of returning it as u8string.

What I tried

The naive way to do this, would be:

char* data = ...;
size_t field_offset = ...;
size_t field_length = ...;

char8_t* field_ptr = reinterpret_cast<char8_t*>(data + field_offset);
u8string_view field(field_ptr, field_length);

However, if I understand the C++ strict-aliasing rules correctly, this is undefined behavior because it accesses the contents of the char* buffer via the char8_t* pointer returned by reinterpret_cast, and char8_t is not an aliasing type.

Is that true?

Is there a way to do this safely?

smls
  • 5,738
  • 24
  • 29
  • As far as I know `char` is special here. Is gcc/clang... issuing a warning? – Bernd Aug 11 '20 at 18:47
  • 1
    @Bernd `char` is special but I don't think it applies here. A `char*` can alias anything but a `char8_t*` cannot alias a char as far as I know. – Guillaume Racicot Aug 11 '20 at 18:49
  • 1
    In C++23 we may have `std::start_lifetime_as`, but I'm not sure if there's anything the help that case in C++20 besides acknowledging that you're UBing to achieve that. – Guillaume Racicot Aug 11 '20 at 18:51
  • Have a look at implicit object creation, it may make your program well-defined. – geza Aug 11 '20 at 18:53
  • We know that an allocation of type `char` can be used to hold other objects via placement-new, for example. Considering that both types are trivial, I wonder if it would make the behavior defined to iterate the buffer and assign itself to the target type? E.g. `for (auto i = data + field_offset; i < data + field_offset + field_length; ++i) { *reinterpret_cast(i) = *i; }` If this is defined behavior then that could be a workaround to avoid the UB. Assuming the bit representation of each value is identical, a smart compiler could elide the whole loop. – cdhowie Aug 11 '20 at 19:20
  • If the reinterpret-cast-dereference-assignment operation is UB, then perhaps placement-new would not be? `new (i) char8_t{*i};` ? If defined, it would be interesting to see what compilers do with both loops. – cdhowie Aug 11 '20 at 19:21
  • 1
    If the entire blob is UTF-8 data, why not have it as a bunch of char8_t in the first place? I wouldn't worry about it too much anyway. Real software `reinterpret_cast`s data received from the network or read from files. It's very common practice and the standard is defective for not acknowledging it. – n. m. could be an AI Aug 11 '20 at 19:46

2 Answers2

1

The strict aliasing rule happen when you access an object with a glvalue that has not an acceptable type.

First consider a well defined case:

char* data = reinterpret_cast <char *> (new char8_t[10]{})
size_t field_offset = 0;
size_t field_length = 10;
char8_t* field_ptr = reinterpret_cast<char8_t*>(data + field_offset);
u8string_view field(field_ptr, field_length);
field [0]+field[1];

There is no UB here. You create an array of char8_t then access the element of the array.

Now what happen if the object that is the memory referenced by data is created by another program? According to the standard this is UB, because the object is not created by one of the specified way to create it.

But the fact that your code is not yet supported by the standard is not a problem here. This code is supported by all compilers. If it were not, nothing would work, you could not even do the simplest system call because most of the communication between a program and any kernel is through array of char. So as long as inside your program you access the memory that is between data+field_offset and data+field_offset+field_length through a glvalue of type char8_t your code will work as expected.

Oliv
  • 17,610
  • 1
  • 29
  • 72
  • "nothing would work" -- Well, you can safely serialize/deserialize C++ objects to/from a char array with `memcpy`. Isn't that what is usually done (if copying isn't considered a problem)? – smls Aug 11 '20 at 20:21
  • 1
    "This code is supported by all compilers." -- Where can I find more information about what kind of technically strict-aliasing-unsafe code is actually safe in practice? I've read about strict-aliasing related bugs in open-source libraries, so I don't think that strict-aliasing is a non-issue altogether. GCC also wouldn't have the `-fno-strict-aliasing` switch if it was. – smls Aug 11 '20 at 20:24
  • @smls There is no strict aliasing rule violation, just an access to an object that is not created by the running program. – Oliv Aug 11 '20 at 20:27
  • @smls About your first comment `memcpy` does not change your problem because there must be an original object created according to the specification. With C++20, it is possible to be standard compliant thanks to `bit_cast`. – Oliv Aug 11 '20 at 20:37
  • @smls: Nearly all, if not all non-diagnostic compilers will support such constructs when optimizations are disabled. The maintainers of clang and gcc's optimizers, however, have stated that they feel no obligation to make future compilers behave usefully in cases where all present ones do so but the Standard wouldn't require it. – supercat Aug 11 '20 at 20:39
  • @supercat So let's hope the standard will make such code standard compliant! But here it is supported by all compiler with all optimization on, even LTO. – Oliv Aug 11 '20 at 20:41
  • @Oliv: The C++ Standard has no concept of programs being "compliant" or not. The only distinction it draws is between situations where an implementation would be required to process a program meaningfully, a few where it would be forbidden from executing a program at all, and a large number where an implementation wouldn't be required to behave meaningfully but could (and in many cases probably should, though the Standard regards quality-of-implementation issues as outside its jurisdiction) do so anyway. – supercat Aug 11 '20 at 20:44
1

This same problem occurs occasionally in other contexts too, including the use of shared memory for example.

A trick to create objects using bits in "raw" memory without allocating memory is to create a local object by memcpy, and then create a dynamic copy of that local object over the "raw" memory. Example:

char* begin_raw = data + field_offset;
char8_t* last {};
for(std::ptrdiff_t i = 0; i < field_length; i++) {
    char* current = begin_raw + i;
    char8_t local {};
    std::memcpy(&local, current, sizeof local);
    last = new (current) char8_t(local);
}
char8_t* begin = last - (field_length - 1);
std::u8string_view field(begin, field_length);

Before you object that you don't want to copy, notice that the end result causes no changes to the representation of the "raw" memory. The compiler can notice this too, and can compile the entire loop into zero instructions (in my tests GCC and Clang achieve this with -O2). All that we have done is satisfy the object lifetime rules of the language by creating dynamic objects into the memory.

eerorika
  • 232,697
  • 12
  • 197
  • 326
  • There is still the UB caused by pointer arithmetic. One could try to fix this by using an array of char8_t. But I don't see how it could be done without a compiler extension to the language (carray of dynamic size) or a call to alloca. – Oliv Aug 12 '20 at 05:37
  • Currently I can't see some UB with pointer arithmetic. Can you explain it? – Bernd Aug 12 '20 at 09:38
  • 1
    @Bernd Pointer arithmetic is only allowed on pointer to element of an array (a single object is considered an array of size one for pointer arithmetic rule). There is no array here so `last - (field_length - 1)` is UB. – Oliv Aug 12 '20 at 13:33
  • @Bernd it's the overly strict wording of the standard which defines pointer arithmetic only within arrays. No array object was created in the example. For this same reason, any attempt to write a custom vector is impossible (technically there is array placement new, but that is practically unusable for other reasons). – eerorika Aug 12 '20 at 13:36