0

I have this unicode text which contains unicode characters

  NSString *fileName = @"Tên tình bạn dưới tình yêu.mp3";
  const char *cStringFile = [fileName UTF8String];

Now I need to save this string in hex/binary format to a file in this format

 T  ê  n     t  ì  n  h     b    ạ   n
 54 EA 6E 20 74 EC 6E 68 20 62 1EA1 6E ...... and so on

As you can see the character 'ê' is written as EA, but 'ạ' is written as '1E A1' which is correct as per the Vietnamese character set (https://vietunicode.sourceforge.net/charset/)

To achieve this, this is the code, I used to write multibyte characters to the file

// Determine the required size for the wchar_t string
size_t input_length = strlen(cStringFile);
size_t output_length = mbstowcs(NULL, stringText, input_length);

// Allocate memory for the wchar_t string
wchar_t *output = (wchar_t *)malloc((output_length + 1) * sizeof(wchar_t));
if (output == NULL) {
    printf("Memory allocation failed.\n");
    return 1;
}

// Convert the C string to wchar_t string
mbstowcs(output, cStringFile, input_length);
output[output_length] = L'\0'; // Add null-termination

unsigned long lenth = wcslen(output);
// Loop through each character in the Unicode text
for (int i = 0; i < lenth; i++) {
    // Write the Unicode character to the file
    fwprintf(fd, L"%lc", output[i]);
}

// Free the allocated memory
free(output);

Now the issue is the multibyte characters are not being converted to the correct HEX value with the code above

Example 1) For this text = "Tên tình bạn dưới tình yêu.mp3"
Expected: 
T  ê  n     t  ì  n  h     b    ạ   n
54 EA 6E 20 74 EC 6E 68 20 62 1EA1 6E ...... and so on

Actual: Wrong!
T   ê   n     t   ì   n  h     b   ạ     n
54 C3AA 6E 20 74 C3AC 6E 68 20 62 E1BAA1 6E ...... and so on

Example 2) For this text = "最佳歌曲在这里.mp3"
Expected: 
最-\u6700 佳-\u4F73 歌-\u6B4C 歌-\u66F2  曲-\u5728 
67 00     4F 73    6B 4C        66 F2     57 28  .....  

Actual: Wrong!
最        佳        歌        歌        曲
E6 9C     80 BD    B3 AD     8C 9B     B2 9C    

So I think it is writing 2 bytes in the case of 'ê' and 'ì' and 3 bytes in the case of 'ạ'. The code is not writing the Hex equivalent of the multibyte character.

What could be the issue? Any help would be appreciated.

=====

I tried another approach not using wchar, checking if a character is a multibyte character and writing all bytes if true

    NSString *fileName = @"Tên tình bạn dưới tình yêu.mp3";
    const char *stringText = [fileName UTF8String];
    unsigned long len = strlen(stringText);
    setlocale(LC_ALL, "");
    for (char character = *stringText; character != '\0'; character = *++stringText)
    {
        if (!character) {
            continue;
        }
        putchar(character);
        int byteCount = numberOfBytesInChar((unsigned char)character);
        if (byteCount <= 1) {
            //putchar(character);
            fprintf(fd, "%c", character);
        } else {
           
            //putchar(character);
            for(int k = 0; k < byteCount; k++)
            {
                fprintf(fd, "%c", character);
                character = *++stringText;
            }
        }
    }

    int numberOfBytesInChar(unsigned char val) {
      if (val < 128) {
         return 1;
      } else if (val < 224) {
         return 2;
      } else if (val < 240) {
         return 3;
      } else {
        return 4;
      }
   }

Even now it is not writing the expected Hex equavalent for multibyte characters.

Example 1) For this text = "Tên tình bạn dưới tình yêu.mp3"
Expected: 
T  ê  n     t  ì  n  h     b    ạ   n
54 EA 6E 20 74 EC 6E 68 20 62 1EA1 6E ...... and so on

Actual: Wrong!
T   ê   n     t   ì   n  h     b   ạ     n
54 C3AA 6E 20 74 C3AC 6E 68 20 62 E1BAA1 6E ...... and so on

Example 2) For this text = "最佳歌曲在这里.mp3"
Expected: 
最-\u6700 佳-\u4F73 歌-\u6B4C 歌-\u66F2  曲-\u5728 
67 00     4F 73    6B 4C        66 F2     57 28  .....  

Actual: Wrong!
最        佳        歌        歌        曲
E6 9C     80 BD    B3 AD     8C 9B     B2 9C     

Any pointers?

KamyFC
  • 858
  • 9
  • 17
  • 2
    Do you have a reference for your "expected" multibyte character encoding scheme (which does not look very sensible to me). The "actual" sequence is using standard Unicode UTF-8 encoding. – Ian Abbott Jun 29 '23 at 08:33
  • 1
    Also, isn't it the input string that is encoded as UTF-8? The output in the `output` buffer should contain wide characters (probably 4 bytes per character on Mac OS X and iOS), not multibyte character sequences. – Ian Abbott Jun 29 '23 at 08:54
  • @IanAbbott - The reference can be found here (https://vietunicode.sourceforge.net/charset/) The character 'ê' is written as EA and 'ạ' is written as '1E A1' which is what the 'expected' result shows. – KamyFC Jun 29 '23 at 09:07
  • Avoid use wchar_t and such functions (if you can, they are *obsolete*). Use `char` and things should often be "automatic" with UTF-8 (just remember that one glyphs/characters can have many *char* (but this is also UTF_16, and wchar usually is UCS2, but also in such case there are combining characters, etc.). – Giacomo Catenazzi Jun 29 '23 at 09:15
  • @GiacomoCatenazzi _"... they are obsolete"_: can you point to any Microsoft documentation that says they are obsolete? – Jabberwocky Jun 29 '23 at 09:33
  • @GiacomoCatenazzi Thanks. I tried an approach without using wchar. I added it to the query. It did not work. Can you check? – KamyFC Jun 29 '23 at 10:21
  • @Jabberwocky I'm on macOS. wchar is found in Not sure if it is Microsoft dependent? – KamyFC Jun 29 '23 at 10:22
  • I added couple of examples - Chinese text and Vietnamese text Also added an alternate approach of checking if a character is a multibyte character and writing all bytes if true. This too did not work. The edited question has the alternate approach. – KamyFC Jun 29 '23 at 11:21
  • @KamyFC None of the columns in that charset table contain the hex byte sequence '1E A1'. – Ian Abbott Jun 29 '23 at 11:42
  • @IanAbbott If you open the link, search for '1EA1', you will find the mapping with 'ạ' – KamyFC Jun 29 '23 at 11:45
  • @Jabberwocky: Objective-C is not Microsoft. In any case, can you find some interchangeable format which still recommend UTF-16? (I'm not speaking about using it internally) OTOH I have some impression that also Microsoft is moving to UTF-8 (it would be much easier, but then, like JavaScript, there are API for UCS-2, for UFT-16 and for UTF-8). – Giacomo Catenazzi Jun 29 '23 at 11:53
  • @KamyFC The first column does not count. That is the actual Unicode code point U+1EA1. In UTF-8 it is represented by the bytes 'E1 BA A1'. In UTF-16LE it is represented by the bytes 'A1 1E'. In UTF-16BE it is represented by the bytes '1E A1'. In UTF-32LE it is represented by the bytes 'A1 1E 00 00'. In UTF-32BE it is represented by the bytes '00 00 1E A1'. – Ian Abbott Jun 29 '23 at 11:54
  • @KamyFC: right, With char your code cannot work. You should do the *contrary*. You want to get the Unicode codepoint (warning: most wchar interfaces works with code units so never with the 4-byte UTF-8 characters). So if you start with UTF-8, you should calculate codepoint, so the contrary: check the forst byte) – Giacomo Catenazzi Jun 29 '23 at 11:57
  • @KamyFC You show `最` "actually" encoded as 2 bytes, but actually it is encoded (in UTF-8) as 3 bytes (hex E6 9C 80). You incorrectly show the third byte as encoding the first byte of the next character: `最` (which is actually encoded in UTF-8 as hex E4 BD B3). – Ian Abbott Jun 29 '23 at 12:23
  • @IanAbbott Thank you. So "UTF-16BE" is what I need to look at? As that is the expected output that I need to match. Any pointers on how to write char in "UTF-16BE"? – KamyFC Jun 29 '23 at 12:28
  • @IanAbbott You are right, my 'actual table' for the expected is not right. if you look at the CJK set https://www.compart.com/en/unicode/block/U+4E00 最 is mapped to \u6700 - I'm assuming this too is "UTF-16BE"? – KamyFC Jun 29 '23 at 12:30
  • Your other approach with `for(int k = 0; k < byteCount; k++)` skips the first byte of the following character when byteCount is greater than 1. It should be something like `fprintf(fd, "%c", character);` `for (int k = 1; k < byteCount; k++)` `{` `character = *++stringText;` `fprintf(fd, "%c", character);` `}`. – Ian Abbott Jun 29 '23 at 13:30
  • @KamyFC UTF-16BE would encode ASCII characters as two bytes, e.g. `T` would be encoded as hex bytes 00 54. That is different to what you expected, where `T` would be encoded as a single hex byte 54. What software actually needs the text to be encoded in your weird (and impossible to reliably decode) format? – Ian Abbott Jun 29 '23 at 14:05
  • @IanAbbott Actually encoding ASCII characters as two bytes with a dummy 00 is exactly what the software does which we need to match! I thought 00 was a separator all this time! So they are actually encoding in UTF-16BE. That is interesting. So 'T' = 00 54 : ê = 00 EA : n = 00 6E : ạ = 1E A1 : If ASCII characters are encoded as two bytes, How many bytes will Unicode characters be encoded in, with UTF-16BE? The software that does this crazy obfuscation is called Serato - a DJ management tool. – KamyFC Jun 29 '23 at 15:02
  • @IanAbbott Thanks for the correction in the loop. I will attempt to use your suggestion. Also I will see how to encode in UTF-16BE using Objective C – KamyFC Jun 29 '23 at 15:04
  • Each UTF-16(BE or LE) code unit is 2 bytes. All characters below U+010000 (i.e. characters in the Unicode Basic Multilingual Plane) will be encoded in a single code unit (i.e. 2 bytes). Characters U+010000 onwards will be encoded in two code units (i.e. 4 bytes) with special values. The first code unit (the "high surrogate") will be in the range U+D800 to U+DBFF and the second code unit (the "low surrogate") will be in the range U+DC00 to U+DFFF. Together, the high surrogate and low surrogate code units form a "surrogate pair". – Ian Abbott Jun 29 '23 at 15:14
  • I don't know Objective C, but you can use something called an "iconv" library. – Ian Abbott Jun 29 '23 at 15:17
  • The functions you are looking for are `iconv_open()`, `iconv()`, and `iconv_close()`, which are part of the POSIX standard, but the names of the "tocode" and "fromcode" character sets are implementation defined. You will need to find out the names that your target OS uses for these character sets. – Ian Abbott Jun 29 '23 at 15:27
  • Does `[songInfo.fileNamePath dataUsingEncoding:NSUTF16BigEndianStringEncoding]` return the data you want? – Willeke Jun 29 '23 at 16:17
  • @Willeke Thanks for the suggestion. I tried *** NSData *dataBE = [songInfo.fileNamePath dataUsingEncoding:NSUTF16BigEndianStringEncoding]; NSString *objCStringPath = [[NSString alloc] initWithData:dataBE encoding:NSUTF16BigEndianStringEncoding]; const char *cStringPath = (const char *)[objCStringPath cStringUsingEncoding:NSUTF16BigEndianStringEncoding]; *** Interesting thing is I can see the path in 'objCStringPath' but cStringPath is "". cStringUsingEncoding does not seem to like NSUTF16BigEndianStringEncoding. Will check. – KamyFC Jun 29 '23 at 16:39
  • Thanks @IanAbbott, I'm checking if we can do UTF16BigEndian in Objective C, if not, I will check iconv_open(). I'm on macOS, when I run 'iconv -l' in Terminal, I see many encoding entries - and we find the ones I'm interested in "UTF-8 UTF8" and "UTF-16BE". So will I have to use the iconv library to convert every character and then write the character to the file, or write it as usual and then convert the entire file contents to "UTF-16BE"? What do you propose is the right approach? – KamyFC Jun 29 '23 at 16:43
  • If this is the data you want then you can write it to a file. You don't need the C string. from the documentation of `cStringUsingEncoding`: "UTF-16 and UTF-32 are not considered to be C string encodings, and should not be used with this method—the results of passing NSUTF16StringEncoding, NSUTF32StringEncoding, or any of their variants are undefined." – Willeke Jun 29 '23 at 17:02
  • @Willeke Thanks for the nudge towards the approach of using NSString methods. The following worked - ***** NSData *dataBE = [fileName dataUsingEncoding:NSUTF16BigEndianStringEncoding]; NSString *objCStringPath = [[NSString alloc] initWithData:dataBE encoding:NSUTF16BigEndianStringEncoding]; [objCStringPath writeToFile:@"/Users/user/Desktop/test" atomically:YES encoding:NSUTF16BigEndianStringEncoding error:nil]; **** Now I can see the UTF-16BE encoding in the file with the correct output as expected. – KamyFC Jun 30 '23 at 06:42
  • @Willeke Thank you, you can add an answer and I can mark it as answered, so you get the credit. – KamyFC Jun 30 '23 at 06:44
  • Also super thanks to @IanAbbott for helping me decipher that the encoding expected was indeed "UTF16BigEndian", which pointed me to the right direction. – KamyFC Jun 30 '23 at 06:51
  • Hey, I think you should know that, the code point is a bit different from the multibyte stored in your file. It's very different for different encoding formats, eg: UTF-8, UTF-16, UTF-32. – Neal.Marlin Jun 30 '23 at 08:42

1 Answers1

1

NSString can work with encodings.

Extract the data from the string and write it to disk:

NSData *dataBE = [fileName dataUsingEncoding:NSUTF16BigEndianStringEncoding];
[dataBE writeToFile:@"/Users/user/Desktop/test" options:NSDataWritingAtomic error:&error];

or write the string to disk:

[fileName writeToFile:@"/Users/user/Desktop/test" atomically:YES encoding:NSUTF16BigEndianStringEncoding error:&error];
Willeke
  • 14,578
  • 4
  • 19
  • 47