Encoding ignored while reading InputStream

Question

I'm having some encoding problems in a Java application that makes HTTP requests to an IIS server.

Iterating over the headers of the URLConnection object I can see the following (relevant) headers:

Transfer-Encoding: [chunked]
Content-Encoding: [utf-8]
Content-Type: [text/html; charset=utf-8]

The URLConnection.getContentEncoding() method returns utf-8 as the document encoding.

This is how my HTTP request, and stream read is being made:

OutputStreamWriter sw = null;
BufferedReader br = null;
char[] buffer = null;
URL url;
url = new URL(this.URL);
URLConnection connection = url.openConnection();
connection.setDoOutput(true);
sw = new OutputStreamWriter(connection.getOutputStream());
sw.write(postData);
sw.flush();
br = new BufferedReader(new InputStreamReader(connection.getInputStream(), "UTF8"));
StringBuilder totalResponse = new StringBuilder();
String line;

while((line = br.readLine()) != null) {
    totalResponse.append(line);
}
buffer = totalResponse.toString().toCharArray();
if (sw != null)
    sw.close();

if (br != null)
    br.close();

return buffer;

However the following string sent by the server "ÃÃÃção" is received by the client as "��o".

What am I doing wrong ?

Thanks @Tirath for the reply. I've changed UTF8 to UTF-8 as a argument for the InputStreamReader constructor, but the result was the same. — guanabara, Oct 02 '14 at 09:47
Are you sure your content is **actually** UTF-8 encoded? Headers can lie. Also have you tried debugging `totalResponse.toString()`? If that equals `"ÃÃÃção"`, then your issue may be further down the line, when operating on the the `char[]`... — Mena, Oct 02 '14 at 09:50
Thanks @Mena, how can I **actually** verify the content encoding. Using: `byte[] foo = String.valueOf(totalResponse.toString()).getBytes(); System.out.println(new String(foo, "utf-8"));` gives the exact same result. — guanabara, Oct 02 '14 at 09:56
May not be related, but you should also set an explicit encoding when you create the `OutputStreamWriter` - at the moment you're sending the post data in whatever is the default encoding on your platform, which may not be what the server expects. — Ian Roberts, Oct 02 '14 at 09:57
@guanabara there is no certain way to infer encoding as far as I know, this is typically something known in advance. If the content comes from your `OutputStream`, then you should follow Ian Roberts' advice. Worst case scenario you might be in for some good old trial and error. Although most common encodings are UTF-8 and ISO Latin 1. — Mena, Oct 02 '14 at 10:00
@IanRoberts, @Mena the result is the same even when setting the `OutputStreamWriter` charset name as "UTF-8". — guanabara, Oct 02 '14 at 10:06
@guanabara just making sure here. It's `"UTF-8"`. Not `"UTF8"`. Not `"utf-8"`. — Mena, Oct 02 '14 at 10:12
To be honest I'm not sure I'd put any trust in a server that claimed `Content-Encoding: utf-8` - the `Content-Encoding` header has nothing to do with character sets, it's for things like on-the-fly compression and if it's present at all then it should be something like `Content-Encoding: gzip` — Ian Roberts, Oct 02 '14 at 10:18
@IanRoberts, the `Content-Encoding` set to utf8, was my mistake. I have misunderstand the header purpose. I've already removed it. @Mena, both server side and client side, are now using 'UTF-8' as charset, and no other variation (utf8, UTF8, etc). — guanabara, Oct 02 '14 at 10:27
@guanabara actually it's slightly more complicated, as character encodings for HTML are not defined **exactly** the same way as Java. See specifications [here](http://www.w3schools.com/charsets/default.asp) and [here](http://docs.oracle.com/javase/8/docs/technotes/guides/intl/encoding.doc.html). It's also worth noting that URL encoding of extended characters can be very useful when dealing with IE (through JavaScript's `encodeURIComponent`, etc.). — Mena, Oct 02 '14 at 10:39

score 1 · Answer 1 · answered Oct 03 '14 at 05:28

Based on your comments, you are trying to receive a FIX message from an IIS server and FIX uses ASCII. There are only a small subset of tags which support other encoding and they have to be treated in a special manner (non-ASCII tags in the standard FIX spec are 349,351,353,355,357,359,361,363,365). If such tags are present, you will get a tag 347 with a value specifying the encoding (for example UTF-8) and then each tag, will be preceded by a tag giving you the length of the coming encoded value (for tag 349, you will always get 348 first with an integer value)

In your case, it looks like the server is sending a custom tag 10411 (the 10xxx range) in some other encoding. By convention, the preceding tag 10410 should give you the length of the value in 10411, but it contains "0000" instead, which may have some other meaning.

Note that although FIX message are very readable, they should still be treated as binary data. Tags and values are mostly ASCII characters, but the delimiter (SOH) is 0x01 and as mentioned above, certain tags may be encoded with another encoding. The IIS service should really return the data as application/octet-stream so it can be received properly. Attempting to return it as text/html is asking for trouble :).

you are correct. This is a custom message protocol, based on FIX. Setting the `Content-Type` as `application/octet-stream` has the same result (��o for ÃÃÃção). Thanks for your reply. — guanabara, Oct 03 '14 at 07:53

score 0 · Answer 2 · answered Oct 02 '14 at 11:32

0

If the server really sends a Content-Encoding of "UTF-8" then it is very confused. See http://svn.tools.ietf.org/svn/wg/httpbis/specs/rfc7231.html#header.content-encoding

answered Oct 02 '14 at 11:32

Julian Reschke

40,156
8
95
98

the Content-Encoding tag was my mistake. It's not included now. – guanabara Oct 02 '14 at 11:49

score 0 · Answer 3 · answered Oct 02 '14 at 12:11

0

For good order a couple of corrections.

    URLConnection connection = url.openConnection();
    connection.setDoOutput(true);
    connection.connect();
    try (Writer sw = new OutputStreamWriter(connection.getOutputStream(),
                StandardCharsets.UTF_8)) {
        sw.write(postData);
        sw.flush();

        try (BufferedReader br = new BufferedReader(
                new InputStreamReader(connection.getInputStream(),
                StandardCharsets.UTF_8))) {
            StringBuilder totalResponse = new StringBuilder();
            String line;
            while ((line = br.readLine()) != null) {
                totalResponse.append(line).append("\r\n");
            }
            return totalResponse.toString().toCharArray();
        } // Close br.
    } // Close sw.

Maybe:

postData =  ... + "Accept-Charset: utf-8\r\n" + ...;

Receiving the totalResponse.toString() you should have all read correctly.

But then when displaying again, the String/char is again converted to bytes, and there the encoding fails. For instance System.out.println will not do as probably the Windows encoding is used.

You can test the String by dumping its bytes:

String s = totalResponse.toString();
Logger.getLogger(getClass().getName()).log(Level.INFORMATION, "{0}",
    Arrays.toString(s.getBytes(StandardCharsets.UTF_8)));

In some rare cases the font will not contain the special characters.

answered Oct 02 '14 at 12:11

Joop Eggen

107,315
7
83
138

thanks for the reply. The result was the same as before. For the following string: `INFO: 8=FIX.4.29=3335=DRCFG10410=000010411=��o10=000 \0` i get the following bytes: `INFO: [56, 61, 70, 73, 88, 46, 52, 46, 50, 1, 57, 61, 51, 51, 1, 51, 53, 61, 68, 82, 67, 70, 71, 1, 49, 48, 52, 49, 48, 61, 48, 48, 48, 48, 1, 49, 48, 52, 49, 49, 61, -17, -65, -67, -17, -65, -67, -17, -65, -67, -17, -65, -67, -17, -65, -67, 111, 1, 49, 48, 61, 48, 48, 48, 1, 13, 10, 92, 48]` – guanabara Oct 02 '14 at 13:14
1

If you look at 411=��, that is 48+4(4), 48+1(1), 48+1(1), 61(=) one sees a repetition of four **identical** multi-byte sequences. In fact of `U+FFFD`, the Unicode **replacement character**. As UTF-8 can represent all, in an earlier conversion from Unicoce, say UTF-8 to a limited encoding this conversion was made. Definitely at the IIS side, unless the data stems from a round-trip from the client. – Joop Eggen Oct 02 '14 at 13:31
so If i've undestand you correctly, you're telling me IIS is sending the data in other charset, being converted in the process ? – guanabara Oct 02 '14 at 13:40
The IIS at some point converts wrongly to non-UTF-8 (introducing replacement chars), and finally delivers in UTF-8. As sanity check, maybe query the same thing in a browser. – Joop Eggen Oct 02 '14 at 14:07
thanks again! This service is consumed by several components (ObjectiveC, C#, JS) and the only one with problems in the encoding is this one in JAVA. In the server side, I can see the message being sent correctly. Any more thoughts on what is happening ? – guanabara Oct 02 '14 at 14:39
One thought would be switching to apache HttpClient. Or intercepting the communication using a monitoring proxy. What about the postData? did you try it with UTF-8? – Joop Eggen Oct 02 '14 at 15:01
thanks. I'll try the Apache HttpClient and post the results soon. – guanabara Oct 03 '14 at 07:55
Are you using Tomcat server or something? since Tomcat has configuration in server.xml file to mention UTF-8 specifically. Please let me know. – Sushant Tambare Oct 07 '14 at 11:11

score 0 · Answer 4 · answered Oct 07 '14 at 06:22

0

Can you try by putting the stream as part of request attribute and then printing it out on client side. a request attribute will be received as is withou any encoding issues

answered Oct 07 '14 at 06:22

user3271891

21
5

Encoding ignored while reading InputStream

4 Answers4