What is the correct way to encodeURIcomponent non utf-8 characters and decodes them accordingly?

Question

I have a Javascript bookmarklet that uses encodeURIcomponent to pass the URL of the current page to the server side, and then use urldecode on the server side to get the characters back.

The problem is, when the encoded character is not in utf-8 (for my case it's gb2312, but it could be something else), and when the server does the urldecode, the decoded character become squares. Which, obviously, isn't what it looked like before the encoding.

It's a bookmarklet, input could be anything, so I can't just define "encode as gb2312" in the js, or "decode as gb2312" in the php scripts.

So, is there a correct way of using encodeURIcomponent which passes the character encoding together with the contents, and then the decoding can pick the right encoding to decode it?

okm · Answer 1 · 2012-04-30T11:06:01.347

For encoding of browsers, especially for GB2312 charset, check the following docs (in Chinese) first

For your case, %C8%B7%B6%A8 is actually generated from the GB2312 form of '\u786e\u5b9a'. This occurs normally on (legacy?) versions of IE and FF, when user directly inputs Chinese character in location bar,
Or you're using non-standard link from page content which does not perform IRI to URI encoding at all and just render binary string like '/tag/\xc8\xb7\xb6\xa8'(douban.com used to have this usage for tags, now they're using correct URI encoding in UTF8). not quite sure because cannot reproduce in Chrome, maybe test in FF and IE, part about douban is true.

Actually, the correct output of encodeURIComponent should be

> encodeURIComponent('%C8%B7%B6%A8')
  "%25C8%25B7%25B6%25A8"

Thus in server side, when an unquoted string contains non-ascii bytes, you'd better to leave the string as it is, here '%C8%B7%B6%A8'.

Also, you could check in client side to apply encodeURIComponent again on a value that contains %XX where XX is larger than 0x7F. I'm not quite sure whether this against RFC 2396 though.

写英文好累啊，不过还是要入乡随俗～

nice source, I'll check them out:) – lazycai May 02 '12 at 07:13 — lazycai, May 02 '12 at 07:13

cychoi · Answer 2 · 2014-11-18T03:24:59.457

Using escape() and then translate the characters to numeric character reference before sending them to server.

From MDN escape() reference:

The hexadecimal form for characters, whose code unit value is 0xFF or less, is a two-digit escape sequence: %xx. For characters with a greater code unit, the four-digit format %uxxxx is used.

Thus, it's easy to translate the output of escape() to numeric character reference by using a simple replace() statement:

escape(input_value).replace(/%u([0-9a-fA-F]{4})/g, '&#x$1;');

Or, if your server-side language only supports decimal entities, use:

escape(input_value).replace(/%u([0-9a-fA-F]{4})/g, function(m0, m1) {
                return '&#' + parseInt(m1, 16) + ';';
};

Example code in PHP

client.html _{(file encoding: GB2312)}:

<html>
  <head>
    <meta charset="gb2312">
    <script>
    function processForm(form) {
        console.log('BEFORE:', form.test.value);
        form.test.value = escape(form.test.value).replace(/%u(\w{4})/g, function(m0, m1) {
            return '&#' + parseInt(m1, 16) + ';';
        });
        console.log('AFTER:', form.test.value);
        return true;
    }
    </script>
  </head>
  <body>
    <form method="post" action="server.php" onsubmit="return processForm(this);">
      <input type="text" name="test" value="确定">
      <input type="submit">
    </form>
  </body>
</html>

server.php:

<?php
echo '<script>console.log("', 
     $_REQUEST['test'], ' --> ', 
     mb_decode_numericentity($_REQUEST['test'], array(0x80, 0xffff, 0, 0xffff), 'UTF-8'),
     '");</script>';
?>

What is the correct way to encodeURIcomponent non utf-8 characters and decodes them accordingly?

2 Answers2