0

I am saving a File with UTF-8 encoding which contains some information including a name for a button from Dart side with the following code:

file.writeAsString([
          name.length.toString(),
          name + Constants.nativeFileDelimeter,
          ids.length.toString(),
          ids.join(" "),
        ].join(" "));

// Constants.nativeFileDelimeter is "|", it is used so that user can enter a name with whitespaces

I read the same file with C and use FFI to pass data between C and Dart.

        FILE *file;

        file = fopen(filePath, "r");

        if (!file) {
            LOGE("Could not open %s!", filePath);
            *operationState = MediaLoadState::FAILED_TO_LOAD;
            goto cleanup;
        }

        int32_t size;

        if(fscanf(file, "%d ", &size) != 1){
            LOGE("fscanf can not assign variables %s!", filePath);
            *operationState = MediaLoadState::FAILED_TO_LOAD;
            goto cleanup;
        }

        // +1 because C strings ends with /0
        *namePtr = new char[size + 1];

        if (size != 0){
            if(fscanf(file, "%[^|]|", *namePtr) != 1){
                LOGE("fscanf can not assign variables %s!", filePath);
                *operationState = MediaLoadState::FAILED_TO_LOAD;
                goto cleanup;
            }
        }

Dart code that reads the pointer saved by C:

  Pointer<Pointer<Utf8>> _namePtrPtr;
  String get name => Utf8.fromUtf8(_namePtrPtr.value);

My problem is this code works with 0 bugs it even works with japanese, russion characters but when emojis are introduced thing get weird. When I save a file containing emojis and I try to read it with C and Dart ffi I get strange errors thrown by Utf8.fromUtf8. for example:

Unfinished UTF-8 octet sequence (at offset 48)

Sometimes the same code it works and renders the emojis but later on the app crashes randomly. The exceptions thrown seems to be different each time I read the file, sometimes I get no exception but later a crash! It is not consistent. I have no idea what I am doing wrong, I expected it to work with emojis. can anyone help me solve this issue?

cs guy
  • 926
  • 2
  • 13
  • 33

1 Answers1

2

In Dart, String.length returns the number of UTF-16 code units. For reading UTF-8 in C, you need to know the number of UTF-8 bytes instead. Therefore, output utf8.encode(name).length instead of name.length in the Dart code (and import dart:convert). Exceptions and crashes may be because of undefined behavior triggered by too short size.

Storing the data size separately in a text format is error-prone. Better use this approach:

It seems you are using C++. There, you can just open the file as std::ifstream, create a std::string name; and use std::getline(file, name, '|'); to read the name with dynamic size. You can use *namePtr = strdup(name.c_str()) to create a plain C string out of the std::string.

lukasl
  • 436
  • 2
  • 6
  • Thank you for the brilliant answer. I used C file operations because as you can see in your answer your code creates an extra copy of name to store it inside the pointer, I didn't want to copy the string to speed things up – cs guy Jan 12 '21 at 21:58
  • just tested it with `utf8.encode(name).length` no issues so far, seems very stable, all the old issues are gone, thank you so much, this bugged me for two days. Since file reading is defaulted to UTF8 I though the length was utf8 length. Silly me :) – cs guy Jan 12 '21 at 22:14
  • Generally, one should prefer robustness or at least crash safety over speed if possible. And if you need speed, you generally need to benchmark. The extra UTF-8 encoding step, int to string conversion (of the length) in Dart, and the string to int conversion in C is probably more expensive than an extra string copy in C would be. Also, there are other ways for speeding this up, for example, you can pass just the c_str pointer without strdup, if you just keep the std::string in C++ until the Utf8.fromUtf8 call on Dart completes. – lukasl Jan 12 '21 at 22:44
  • Wise words but am I wrong for thinking this convention is safe? I mean it's for sure crashed in this case but my length was wrong. What could possibly go wrong with this way of reading a file ie. First size then the characters. I really can't see what can go wrong. Could you help me see it? As long as the input file is properly written, I can not see any reason for C file io to fail tbh. I am open to any suggestions! – cs guy Jan 13 '21 at 00:48
  • Programs generally should be safe (no undefined behavior) even in case of invalid external inputs such as file contents. Otherwise, it could be seen as a vulnerability. As I explained, the current approach is probably not even faster than using just the delimiter. Or use fgets instead of fscanf if you really want to do read by size (and omit the redundant delimiter). This would eliminate undefined behavior too. But the current approach is unnecessarily error-prone and not even the fastest. – lukasl Jan 13 '21 at 01:26