Can stringEncodingForData:encodingOptions:convertedString:usedLossyConversion: return NSUTF16StringEncoding or NSUTF32StringEncoding?

Question

I'd like to know if calling stringEncodingForData:encodingOptions:convertedString:usedLossyConversion: can return NSUTF16StringEncoding, NSUTF32StringEncoding or any of their variants?

The reason I'm asking is because of this documentation note on cStringUsingEncoding::

Special Considerations

UTF-16 and UTF-32 are not considered to be C string encodings, and should not be used with this method—the results of passing NSUTF16StringEncoding, NSUTF32StringEncoding, or any of their variants are undefined.

So I understand that creating a C string with UTF-16 or UTF-32 is unsupported, but I'm not sure if attempting String Encoding Detection with stringEncodingForData:encodingOptions:convertedString:usedLossyConversion: may return UTF-16 and UTF-32 or not.

An example scenario, (adapted from SSZipArchive.m), may be:

// name is a null-terminated C string built with `fread` from stdio.h:
char *name = (char *)malloc(size_name + 1);
size_t read = fread(name, 1, size_name + 1, file);
name[size_name] = '\0';

// dataName is the data object of name
NSData *dataName = [NSData dataWithBytes:(const void *)name length:sizeof(unsigned char) * size_name];

// stringName is the string object of dataName
NSString *stringName = nil;
NSStringEncoding encoding = [NSString stringEncodingForData:dataName encodingOptions:nil convertedString:&stringName usedLossyConversion:nil];

In the above code, can encoding be NSUTF16StringEncoding, NSUTF32StringEncoding or any of their variants?

_{Platforms: macOS 10.10+, iOS 8.0+, watchOS 2.0+, tvOS 9.0+.}

Rob Napier · Accepted Answer · 2018-12-26T16:33:29.293

4

Yes, if the string is encoded using one of those encodings. The notes about C strings are specific to C strings. An NSString is not a C string, and the method you're describing doesn't work on C strings; it works on arbitrary data that may be encoded in a wide variety of ways.

As an example:

#import <Foundation/Foundation.h>

int main(int argc, const char * argv[]) {
    @autoreleasepool {
        NSData *data = [@"test" dataUsingEncoding:NSUTF16StringEncoding];
        NSStringEncoding encoding = [NSString stringEncodingForData:data
                                                    encodingOptions:nil
                                                    convertedString:nil
                                                usedLossyConversion:nil];
        NSLog(@"%ld == %ld", (unsigned long)encoding, 
                             (unsigned long)NSUTF16StringEncoding);
    }
    return 0;
}
// Output:   10 == 10

This said, in your specific example, if name is really what it says it is, "a null-terminated C string," then it could never be UTF-16, because C strings cannot be encoded in UTF-16. C strings are \0 terminated, and \0 is a very common character in UTF-16. Without seeing more code, however, I would not gamble on whether that comment is accurate.

If your real question here is "given an arbitrary c-string-safe encoding, is it possible that stringEncodingForData: will return a not-c-string-safe encoding," then the answer is "yes, it could, and it's definitely not promised that it won't even if it doesn't today." If you need to prevent that, I recommend using NSStringEncodingDetectionSuggestedEncodingsKey and ...UseOnlySuggestedEncodingsKey to force it to be an encoding you can handle. (You could also use ...DisallowedEncodingsKey to prevent specific multi-byte encodings, but that wouldn't be as robust.)

edited Dec 26 '18 at 16:33

answered Dec 26 '18 at 16:13

Rob Napier

286,113
34
456
610

Thank you for the time spent on this answer. Regarding seeing more code, it's all on the github link from the question. Yes, it can be an arbitrary C-string because I need to protect the lib from malicious files. On the day of the question I wasn't sure if a crafty C-string could be forged to be auto-detected as UTF-16 or UTF-32, including some unpaired surrogate and subsequently make the lib crash. So two months ago I added a _guard check_ at the end of [this commit](https://github.com/ZipArchive/ZipArchive/commit/a20807cb1f4e3a6887d71c5cf63a928a2bf3828c#diff-4654b741479e58db199e13624f81952d) – Cœur Dec 26 '18 at 16:54
My expectation is that one could likely create filenames that would decode as invalid UTF-16, particularly if the filename could be encoded in the zip file rather than the filesystem. But if UTF-16 (or invalid UTF-16) would crash the library, I'd definitely think you'd want to use `NSStringEncodingDetectionSuggestedEncodingsKey` to prevent it at the source. That said, I'm wondering if there's any danger of a lossy conversion being used, and creating to collisions. Currently there's no check for a lossy conversion. – Rob Napier Dec 26 '18 at 21:41
@RobNapier: Why does the result not change when doing for example `NSData *data = [@"" dataUsingEncoding:NSUTF32StringEncoding];` and leaving `NSLog` as `NSUTF16StringEncoding`? – l'L'l Dec 29 '18 at 01:05
1

@l'L'l, this method is highly heuristic. One of its heuristics is that it will allow no more than one \0 character when decoding. in UTF-32 is fffe0000 43f60100, which can be decoded legally as the UTF-16 string "\0\f643\1". (U+F643 is private codepoint, but is still legal UTF-16.) Just one \0, so UTF-16 is accepted, and has a higher priority than UTF-32, so is selected. On the other hand, "test" in UTF-32 is fffe0000 74000000 65000000 73000000 74000000, which would decode to "\0t\0e\0s\0t\0" in UTF-16, which has 5 \0 characters, so UTF-16 is rejected, and it correctly selects UTF-32. – Rob Napier Dec 29 '18 at 17:54
1

Many encodings are ambiguous. You can almost always decode a string of bytes in many different ways, and this method has a lot of preferences baked in (it prefers ASCII to UTF-8 for instance). This is the most complicated method I've ever tried to reverse engineer in Foundation, and I don't understand all of it yet. It does a *lot* of stuff and includes all kinds of special cases (particularly for Japanese and Korean) and "confidence" heuristics that it requires to exceed either 70% or 99% in some cases. – Rob Napier Dec 29 '18 at 18:01

Can stringEncodingForData:encodingOptions:convertedString:usedLossyConversion: return NSUTF16StringEncoding or NSUTF32StringEncoding?

Special Considerations

1 Answers1