2

I'd like to know if calling stringEncodingForData:encodingOptions:convertedString:usedLossyConversion: can return NSUTF16StringEncoding, NSUTF32StringEncoding or any of their variants?

The reason I'm asking is because of this documentation note on cStringUsingEncoding::

Special Considerations

UTF-16 and UTF-32 are not considered to be C string encodings, and should not be used with this method—the results of passing NSUTF16StringEncoding, NSUTF32StringEncoding, or any of their variants are undefined.

So I understand that creating a C string with UTF-16 or UTF-32 is unsupported, but I'm not sure if attempting String Encoding Detection with stringEncodingForData:encodingOptions:convertedString:usedLossyConversion: may return UTF-16 and UTF-32 or not.

An example scenario, (adapted from SSZipArchive.m), may be:

// name is a null-terminated C string built with `fread` from stdio.h:
char *name = (char *)malloc(size_name + 1);
size_t read = fread(name, 1, size_name + 1, file);
name[size_name] = '\0';

// dataName is the data object of name
NSData *dataName = [NSData dataWithBytes:(const void *)name length:sizeof(unsigned char) * size_name];

// stringName is the string object of dataName
NSString *stringName = nil;
NSStringEncoding encoding = [NSString stringEncodingForData:dataName encodingOptions:nil convertedString:&stringName usedLossyConversion:nil];

In the above code, can encoding be NSUTF16StringEncoding, NSUTF32StringEncoding or any of their variants?


Platforms: macOS 10.10+, iOS 8.0+, watchOS 2.0+, tvOS 9.0+.

Cœur
  • 37,241
  • 25
  • 195
  • 267

1 Answers1

4

Yes, if the string is encoded using one of those encodings. The notes about C strings are specific to C strings. An NSString is not a C string, and the method you're describing doesn't work on C strings; it works on arbitrary data that may be encoded in a wide variety of ways.

As an example:

#import <Foundation/Foundation.h>

int main(int argc, const char * argv[]) {
    @autoreleasepool {
        NSData *data = [@"test" dataUsingEncoding:NSUTF16StringEncoding];
        NSStringEncoding encoding = [NSString stringEncodingForData:data
                                                    encodingOptions:nil
                                                    convertedString:nil
                                                usedLossyConversion:nil];
        NSLog(@"%ld == %ld", (unsigned long)encoding, 
                             (unsigned long)NSUTF16StringEncoding);
    }
    return 0;
}
// Output:   10 == 10

This said, in your specific example, if name is really what it says it is, "a null-terminated C string," then it could never be UTF-16, because C strings cannot be encoded in UTF-16. C strings are \0 terminated, and \0 is a very common character in UTF-16. Without seeing more code, however, I would not gamble on whether that comment is accurate.

If your real question here is "given an arbitrary c-string-safe encoding, is it possible that stringEncodingForData: will return a not-c-string-safe encoding," then the answer is "yes, it could, and it's definitely not promised that it won't even if it doesn't today." If you need to prevent that, I recommend using NSStringEncodingDetectionSuggestedEncodingsKey and ...UseOnlySuggestedEncodingsKey to force it to be an encoding you can handle. (You could also use ...DisallowedEncodingsKey to prevent specific multi-byte encodings, but that wouldn't be as robust.)

Rob Napier
  • 286,113
  • 34
  • 456
  • 610
  • Thank you for the time spent on this answer. Regarding seeing more code, it's all on the github link from the question. Yes, it can be an arbitrary C-string because I need to protect the lib from malicious files. On the day of the question I wasn't sure if a crafty C-string could be forged to be auto-detected as UTF-16 or UTF-32, including some unpaired surrogate and subsequently make the lib crash. So two months ago I added a _guard check_ at the end of [this commit](https://github.com/ZipArchive/ZipArchive/commit/a20807cb1f4e3a6887d71c5cf63a928a2bf3828c#diff-4654b741479e58db199e13624f81952d) – Cœur Dec 26 '18 at 16:54
  • My expectation is that one could likely create filenames that would decode as invalid UTF-16, particularly if the filename could be encoded in the zip file rather than the filesystem. But if UTF-16 (or invalid UTF-16) would crash the library, I'd definitely think you'd want to use `NSStringEncodingDetectionSuggestedEncodingsKey` to prevent it at the source. That said, I'm wondering if there's any danger of a lossy conversion being used, and creating to collisions. Currently there's no check for a lossy conversion. – Rob Napier Dec 26 '18 at 21:41
  • @RobNapier: Why does the result not change when doing for example `NSData *data = [@"" dataUsingEncoding:NSUTF32StringEncoding];` and leaving `NSLog` as `NSUTF16StringEncoding`? – l'L'l Dec 29 '18 at 01:05
  • 1
    @l'L'l, this method is highly heuristic. One of its heuristics is that it will allow no more than one \0 character when decoding. in UTF-32 is fffe0000 43f60100, which can be decoded legally as the UTF-16 string "\0\f643\1". (U+F643 is private codepoint, but is still legal UTF-16.) Just one \0, so UTF-16 is accepted, and has a higher priority than UTF-32, so is selected. On the other hand, "test" in UTF-32 is fffe0000 74000000 65000000 73000000 74000000, which would decode to "\0t\0e\0s\0t\0" in UTF-16, which has 5 \0 characters, so UTF-16 is rejected, and it correctly selects UTF-32. – Rob Napier Dec 29 '18 at 17:54
  • 1
    Many encodings are ambiguous. You can almost always decode a string of bytes in many different ways, and this method has a lot of preferences baked in (it prefers ASCII to UTF-8 for instance). This is the most complicated method I've ever tried to reverse engineer in Foundation, and I don't understand all of it yet. It does a *lot* of stuff and includes all kinds of special cases (particularly for Japanese and Korean) and "confidence" heuristics that it requires to exceed either 70% or 99% in some cases. – Rob Napier Dec 29 '18 at 18:01