3

Why does converting a String to an URL in Swift 4.2 and then converting the URL back to a String using url.path change the encoding of special characters like german umlauts (ä, ö, ü), even if I use a utf-8 encoding?

I wrote some sample code to show my problem. I encoded the strings to base64 in order to show that there is a difference.

I also have a similar unsolved problem with special characters and swift here.

Sample Code

let string = "/path/to/file"
let stringUmlauts = "/path/to/file/with/umlauts/testäöü"

let base64 = Data(string.utf8).base64EncodedString()
let base64Umlauts = Data(stringUmlauts.utf8).base64EncodedString()

print(base64, base64Umlauts)

let url = URL(fileURLWithPath: string)
let urlUmlauts = URL(fileURLWithPath: stringUmlauts)

let base64Url = Data(url.path.utf8).base64EncodedString()
let base64UrlUmlauts = Data(urlUmlauts.path.utf8).base64EncodedString()

print(base64Url, base64UrlUmlauts)

Output

The base64 and base64Url string stay the same but the base64Umlauts and the base64UrlUmlauts are different.

"L3BhdGgvdG8vZmlsZQ==" for base64

"L3BhdGgvdG8vZmlsZQ==" for base64Url

"L3BhdGgvdG8vZmlsZS93aXRoL3VtbGF1dHMvdGVzdMOkw7bDvA==" for base64Umlauts

"L3BhdGgvdG8vZmlsZS93aXRoL3VtbGF1dHMvdGVzdGHMiG/MiHXMiA==" for base64UrlUmlauts

When I put the base64Umlauts and base64UrlUmlauts strings into an online Base64 decoder, they both show /path/to/file/with/umlauts/testäöü, but the ä, ö, ü are different (not visually).

Yakuhzi
  • 969
  • 6
  • 20

1 Answers1

4

stringUmlauts.utf8 uses the Unicode characters äöü.

But urlUmlauts.path.utf8 uses the Unicode characters aou each followed by the combining ¨.

This is why you get different base64 encoding - the characters look the same but are actually encoded differently.

What's really interesting is that Array(stringUmlauts) and Array(urlUmlauts.path) are the same. The difference doesn't appear until you perform the UTF-8 encoding of the otherwise exact same String values.

Since the base64 encoding is irrelevant, here's a more concise test:

let stringUmlauts = "/path/to/file/with/umlauts/testäöü"
let urlUmlauts = URL(fileURLWithPath: stringUmlauts)

print(stringUmlauts, urlUmlauts.path) // Show the same

let rawStr = stringUmlauts
let urlStr = urlUmlauts.path

print(rawStr == urlStr) // true
print(Array(rawStr) == Array(urlStr)) // true
print(Array(rawStr.utf8) == Array(urlStr.utf8)) // false!!!

So how is the UTF-8 encoding of two equal strings different?

One solution to this is to use precomposedStringWithCanonicalMapping on the result of path.

let urlStr = urlUmlauts.path.precomposedStringWithCanonicalMapping

Now you get true from:

print(Array(rawStr.utf8) == Array(urlStr.utf8)) // now true
rmaddy
  • 314,917
  • 42
  • 532
  • 579
  • Thanks, now it works! But when I pass this `String` as an argument to a `Process`, then the `NSTask` seems to convert it under the hood to the wrong encoding (see [here](https://stackoverflow.com/questions/53049738/adb-command-from-macos-application-with-special-characters)). Do you how I can fix the problem there? – Yakuhzi Nov 01 '18 at 11:58