3

Can we specify the character encoding of parameters in a POST request with a application/x-www-form-urlencoded content type in an API (e.g. RESTful web service), and if "Yes", how?

The parameters will be encoded according the algorithm specified here: URL-encoded form data

Before strings can be percent-encoded (which operates on bytes), they need to be represented as a stream of code units with a particular character encoding.

For Forms, this character encoding can be determined by the Form attributes sent from the server, for example through a hidden _charset_ entry in the form data set or an accept-charset attribute.

However, since an API request doesn't have a corresponding Form, we cannot deduce the character encoding which is accepted/desired by the server.

It seems, the only reasonable encoding is UTF-8. This is the default encoding when no such character encoding can be determined from the Form.

(Related question) (but no duplicate)

Community
  • 1
  • 1
CouchDeveloper
  • 18,174
  • 3
  • 45
  • 67
  • possible duplicate of [Please help me trace how charsets are handled every step of the way](http://stackoverflow.com/questions/1542107/please-help-me-trace-how-charsets-are-handled-every-step-of-the-way) – Paul Sweatte Oct 18 '13 at 16:54
  • @PaulSweatte Sorry, this is not an answer to above problem. This above problem is specifically related to the *character encoding* that was applied to parameter strings in order to create a byte stream which can be *percent-encoded* for use with an entity body with content type `application/x-www-form-urlencoded`. The problem is, that the info about the kind of character encoding will be lost in the final encoded byte stream, and there is no way to tell the receiver what it was. – CouchDeveloper Oct 18 '13 at 17:45
  • Can't you just use a custom request header to tell the receiver what it was? – Paul Sweatte Oct 18 '13 at 18:20
  • This would require server side custom code which checks the custom header. Given the fact, that this "application/x-www-form-urlencoded "content type is wide spread and ubiquitous, it must work somehow ;) Well, I suspect, servers will likely just assume UTF-8, which is reasonable. – CouchDeveloper Oct 18 '13 at 19:00
  • One other approach would be to add a charset parameter to the content type, e.g.: `Content-Type: application/x-www-form-urlencoded; charset=utf-8` -- however, it is explicitly stated that the application/x-www-form-urlencoded Content-Type does not have any parameters, and parameters will be ignored. – CouchDeveloper Oct 18 '13 at 19:03
  • Another alternative would be a [multipart document](http://developer.yahoo.com/performance/rules.html#multipart). Both the unencoded data and the encoded data can be sent in [one request](http://msdn.microsoft.com/en-us/library/exchange/ms988645). – Paul Sweatte Oct 19 '13 at 00:07

2 Answers2

1

Use of one the following solutions:

Paul Sweatte
  • 24,148
  • 7
  • 127
  • 265
  • A HTTP message with a "multipart/form-data" message where each parameter will be represented as a separate part, let us indeed specify the content type and (a possible) encoding of the MIME entity (that is, each parameter value). While the answer is useful, still, I'm searching for answers like "You can't because ...", or "Don't worry if you use UTF-8 character encoding ...", or "That depends on the server ..." or "Usually, server will assume UTF-8, but ..." ;) – CouchDeveloper Nov 06 '13 at 09:51
0

Unfortunately, browsers are still VERY stupid when sending data. This was an issue when JavaEE 5 was relevant, and still is today -- you can check your browser's form submission data and you will see that it does not contain ANY charset encoding information!

Read https://docs.oracle.com/cd/E19316-01/819-3669/bnayd/index.html.

For that reason the server part decoding the form data must magically know the encoding in most cases. The simple solution is to specify accept-charset=iso-8859-1 if one does not need UTF-8. Otherwise specify accept-charset=utf-8 and make sure the server part decoding the form data assumes utf-8 by default..... I wonder how hard it can be to add a charset encoding parameter to the browser's request and make it mandatory to specify it -- ffs. This is probably the dumbest thing I have ever seen.

See also HttpServletRequest - setCharacterEncoding seems to do nothing

user1050755
  • 11,218
  • 4
  • 45
  • 56