0

I thought values entered in forms are properly encoded by browsers.

But this simple test file "test_get_vs_encodeuri.html" shows it's not true:

<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd">
<html><head>
   <meta http-equiv="Content-Type" content="text/html; charset=windows-1252">
   <title></title>
</head><body>

<form id="test" action="test_get_vs_encodeuri.html" method="GET" onsubmit="alert(encodeURIComponent(this.one.value));">
   <input name="one" type="text" value="Euro-€">
   <input type="submit" value="SUBMIT">
</form>

</body></html>

When hitting submit button:

encodeURICompenent encodes input value into "Euro-%E2%82%AC"

while browser into the GET query writes only a simple "Euro-%80"

  1. Could someone explain?

  2. How do i encode everything in the same way of the borwser's FORM (windows-1252) using Javascript??? (escape function does not work, encodeURIComponent does not work either)?

Or is encodeURIComponent doing unnecessary conversions?

Marco Demaio
  • 33,578
  • 33
  • 128
  • 159

2 Answers2

5

This is a character encoding issue. Your document is using the charset Windows-1252 where the is at position 128 that is encoded with Windows-1252 as 0x80. But encodeURICompenent is expecting the input to be UTF-8, thus using Unicode’s charset where the is at position 8364 (PDF) that is encoded with UTF-8 0xE282AC.

A solution would be to use UTF-8 for your document as well. Or you write a mapping to convert UTF-8 encoded strings to Windows-1252.

Tomalak
  • 332,285
  • 67
  • 532
  • 628
Gumbo
  • 643,351
  • 109
  • 780
  • 844
  • @Gumbo: thanks I understand now. But this makes me think at another question that I already asked, what this damn encodeURIComponent is useful for? I mean the value encoded by the FORM can not be wrong even if I use cp1252, so why then should I use this damn encodeURIComponent to encode URI, can't I just use a simple JS escape fucntion that returns values identical to the ones encoded by the FORM. I know it might not be nice, but at the end I prefer to encode things exactly like a browser's FORM would do. http://stackoverflow.com/questions/2238515/encodeuricomponent-is-really-useful – Marco Demaio Apr 11 '10 at 10:24
  • @Marco Demaio: `escape` has a different format: `escape("€")==="%u20AC"`. And as for the purpose of `encodeURIComponent`: Imagine you want to build a URI that contains a `&` as value (like `bar&baz`). `"…?foo=bar&baz"` would yield in two arguments (*foo* and *baz*) because `&` is a special character. But `"…?foo="+encodeURIComponent("bar&baz")` will do it. – Gumbo Apr 11 '10 at 10:50
  • sorry, I didn't explain properly, and I talked rubbish, I know I have to encode characters '&' in GET component, but how do i encode everything in the same way of the FORM with cp1252 using JS? Using escape is not the way, but using encodeURICompoenent is not the way either because € is encoded differently. Is there any function in JS to do that? Sorry, I also updated the question. – Marco Demaio Apr 11 '10 at 11:06
  • @Marco Demaio: As `encodeURICompoenent` expects the string to be UTF-8 encoded, you will need to write your own encoding function. – Gumbo Apr 11 '10 at 11:51
0

I think the root of the problem is character encodings. If I mess around with charset in the meta tag and save the file with different encodings I can get the page to render in the browser like this:

Content encoding issue
(source: boogdesign.com)

That € looks a lot like what you're getting from encodeURIComponent. However I could find no combination of encodings which made any difference to what encodeURIComponent was returning. I can make a difference to what the GET query returns. This is your original page, submitting gives an URL like:

test-get-vs-encodeuri.html?one=Euro-%80

This is a UTF-8 version of the page, submitting gives an URL that looks like this (in Firefox):

http://www.boogdesign.com/examples/encode/test-get-vs-encodeuri-utf8.html?one=Euro-€

But if I copy and paste it I get:

http://www.boogdesign.com/examples/encode/test-get-vs-encodeuri-utf8.html?one=Euro-%E2%82%AC

So it looks like if the page is UTF-8 then the GET and encodeURIComponent match.

Glorfindel
  • 21,988
  • 13
  • 81
  • 109
robertc
  • 74,533
  • 18
  • 193
  • 177
  • encodeURIComponent always assumes UTF-8. From http://www.ecma-international.org/publications/files/ECMA-ST/ECMA-262.pdf : 15.1.3.4 encodeURIComponent (uriComponent) The encodeURIComponent function computes a new version of a URI in which each instance of certain characters is replaced by one, two or three escape sequences representing the UTF-8 encoding of the character. – Mike Samuel Sep 28 '10 at 00:34