Handle non-ASCII filenames in XHR uploading

Question

I have pretty standard javascript/XHR drag-and-drop file upload code, and just came across an unfortunate real-world snag. I have a file on my (Win7) desktop called "TEST-é-TEST.txt". In Chrome (30.0.1599.69), it arrives at the server with filename in UTF-8, which works out fine. In Firefox (24.0), the filename seems mangled when it arrives at the server.

I didn't trust what Firebug/Chrome might be telling me about the encoding, so I examined the hex of the request packet. Everything else is the same except the non-ASCII character is indeed being encoded differently in the two browsers:

Chrome: C3 A9 (this is the expected UTF-8 for that character)
Firefox: EF BF BD (UTF-8 "replacement character"?!)

Is this a Firefox bug? I tried renaming the file, replacing the é with ó, and the Firefox hex was the same... so such a mangle really seems like a browser bug. (If Firefox were confusedly sending along ISO-8859-1, for example, without touching it, I'd see an E9 byte, and I could handle that on the server side, but it shouldn't mangle it!)

Regardless of the reason, is there something I can do on either the client or server sides to correct for this? If a replacement character is indeed being sent to the server, then it would seem unrecoverable there, so I almost certainly need to do it on the client side.

And yes, the page on which this code exists has charset=utf-8, and Firefox confirms that it perceives the page as UTF-8 under View>Character Encoding.

Furthermore, if I dump the filename to console.log, it appears fine there--I guess it's just getting mangled in/after setRequestHeader("X-File-Name",file.name).

Finally, it would seem that the value passed to setRequestHeader() should be able to have code points up to U+00FF, so U+00E9 (é) and U+00F3 (ó) shouldn't cause a problem, though higher codes could trigger a SyntaxError: http://www.w3.org/TR/XMLHttpRequest2/#the-setrequestheader-method

It's hard to say anything useful here without having an idea of exactly where XHR is supposed to be getting the filename in your case. Where is that coming from? — Boris Zbarsky, Oct 08 '13 at 20:21
On drop, e.dataTransfer.files ... I haven't tried hard-coding a filename, to eliminate that variable. — dlo, Oct 08 '13 at 20:49
I just hardcoded the filename; same behavior. xhr.setRequestHeader("X-File-Name", "TEST-é-TEST.txt"); — dlo, Oct 08 '13 at 21:01
I think this really narrows down where the issue is: Firefox's implementation of setRequestHeader(), when you pass it a value that contains non-ASCII values. How to overcome? — dlo, Oct 08 '13 at 21:03
Hmm. So looking at setRequestHeader in Firefox, it's doing what the spec says: dropping the high byte of each 16-bit unit of that JSString and putting the low byte in the header. So I'm not sure why you're seeing EF BF BD there. You should be seeing an E9 instead. What Firefox version are you using? — Boris Zbarsky, Oct 09 '13 at 01:07
Says right at top: 24.0 (I'm impressed if you were looking at FF source code... I tried that too, but it was my first time trying to peel through the layers) — dlo, Oct 09 '13 at 14:36
Well, dealing with FF source code is my day job, so... ;) In Firefox 24 we should be passing through the bytes as-is. In firefox 23 we used to convert them to UTF-8. What are you using to examine the request packet? — Boris Zbarsky, Oct 09 '13 at 18:18
Fiddler, in hex view. btw- I tried to set up a live example at phpfiddle.org, but the server-side doesn't allow access to request headers. Look at this: http://jsfiddle.net/Ggv5V/ ... you just need to move this somewhere you can host echo.php — dlo, Oct 09 '13 at 20:21
So I just tried this: ` ` while running a python server on localhost:31339 that allows the preflight through. The wireshark capture showed the X-File-Name header contains an 0xE9 for both a Chrome 32 dev build and Firefox 24 and newer. Firefox 23 and Chrome 30 release send 0xC3 0xA9 (so it looks like Chrome has aligned with spec since then). None of the browsers send 0xEF 0xBF 0xBD over here... — Boris Zbarsky, Oct 10 '13 at 03:33
You're right: Fiddler (the packet inspector I was using) was misrepresenting the hex--for shame! I broke down the string received by the server and now do see the E9 byte, which is great. So 2 questions: (1) the dropping of the high bytes seems like a cheap hack to make the header string conform with ISO-8859-1 spec; in this case, the character was in fact not mangled, but wouldn't higher char codes be unrecoverable? (2) if I want support for the full range of UTF-8 filenames, what's the recommended approach? encodeURIcomponent() would work, right? — dlo, Oct 10 '13 at 14:23
You're right that dropping the high byte is lossy; unfortunately HTTP doesn't define a way to send general non-ASCCI-ish headers. :( encodeURIComponent would work as a workaround, yes. So would `Array.prototype.map.call(new TextEncoder("UTF-8").encode("é"), c => String.fromCharCode(c)).join("")` or equivalent, but encodeURIComponent will be much more cross-browser compatible (e.g. doesn't depend on the browser implementing TextEncoder). — Boris Zbarsky, Oct 10 '13 at 17:30
Yeah, I thought about quoted-printable, but that's not built-in on the javascript side. It's an unfortunate browser behavior change, as I suspect many others will also see things break in hard-to-find ways. See my self-answer below and let me know if you have any further comments/corrections. Thanks! — dlo, Oct 10 '13 at 17:39

score 9 · Accepted Answer · edited May 23 '17 at 12:09

Thanks so much for Boris's help. Here's a summary of what I discovered through our interactions in comments:

1) The core issue is that HTTP Request headers are supposed to be ISO-8859-1. Prior versions of Chrome and Firefox both passed along UTF-8 strings unchanged in setRequestHeader() calls. This changed in FF24.0 (and apparently will be changing in Chrome soon too), such that FF drops high bytes and passes along only the low byte for each character. In the example I gave in the question, this was recoverable, but characters with higher codes could be mangled irretrievably.

2) One workaround would be to encode on the client side, e.g.:

setRequestHeader('X-File-Name',encodeURIComponent(filename))

and then decode on the server side, e.g. in PHP:

$filename=rawurldecode($_SERVER['HTTP_X_FILE_NAME'])

3) Note that this is only problematic because my ajax file upload approach is to send the raw file data in the request body, so I need to send the filename via a custom request header (as shown in many tutorials online). If I used FormData instead, I wouldn't have to worry about this. I believe if you want solid, standards-based unicode filename support, you should use FormData and not the request header approach.

Handle non-ASCII filenames in XHR uploading

1 Answers1