Receiving encoded response to HttpsURLConnection GET request

Question

I am working on an Android app which will connect to a webpage using the java class HttpsURLConnection and parse the HTML response using JSoup. The issue is that the HTML response from the website appears to be encoded. Any ideas on what I can do to get the actual HTML?

Here is my code for contacting the website:

private String GetPageContent(String url) throws Exception {

        URL obj = new URL(url);
        conn = (HttpsURLConnection) obj.openConnection();

        // default is GET
        conn.setRequestMethod("GET");

        conn.setUseCaches(false);

        // act like a browser
        conn.setRequestProperty("User-Agent", USER_AGENT);
        conn.setRequestProperty("Accept",
                "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8");
        conn.setRequestProperty("Accept-Language", "en-US,en;q=0.8,en-GB;q=0.6");
        conn.setRequestProperty("Accept-Encoding" , "gzip, deflate, sdch");
        conn.setRequestProperty("Connection" , "keep-alive");

        if (cookies != null) {
            for (String cookie : this.cookies) {
                conn.addRequestProperty("Cookie", cookie.split(";", 1)[0]);
            }
        }
        int responseCode = conn.getResponseCode();
        Log.v(TAG,"\nSending 'GET' request to URL : " + url);
        Log.v(TAG,"Response Code : " + responseCode);

        BufferedReader in = new BufferedReader(new InputStreamReader(
                conn.getInputStream()));
        String inputLine;
        StringBuffer response = new StringBuffer();

        while ((inputLine = in.readLine()) != null) {
            response.append(inputLine);
        }
        in.close();

        // Get the response cookies
        setCookies(conn.getHeaderFields().get("Set-Cookie"));

        return response.toString();

    }

And a snippet of the response:

��������������]�r�6��۞�w@ՙ�NDQ�ﱥ|�siv�Kkw�m&�HH�M,  Z��ff_c_o�d�@���9�l�6����� �_=w|����/A{��!W� LZ��������f]�=wc߽�2,˨�|�8x��~�}�x1�$Ib�Uq�7�j�X|;��K

EDIT: The HTML was encoded with GZIP, as shown in the request headers here.

The solution to this issue was to use the GZIPInputStream class as shown below:

BufferedReader in = new BufferedReader(new InputStreamReader(
                new GZIPInputStream(conn.getInputStream())));

It might be possible that the inputstream is using a different characterset than the default one. You might want to try out some of the Character sets as explained in the [documentation](http://docs.oracle.com/javase/7/docs/api/java/io/InputStreamReader.html#InputStreamReader(java.io.InputStream,%20java.nio.charset.Charset)) — TmKVU, Jul 03 '15 at 11:29
I just changed the InputStream to UTF-8 which seems to be what the website is using as seen in it's response headers here: http://pasteboard.co/1Gi6ZjY8.png. But to no avail. Still getting the same encoded response — TheFlanman91, Jul 03 '15 at 11:44

score 1 · Accepted Answer · answered Jul 03 '15 at 11:51

1

Based on the headers returned with the request, we can conclude that the content is encoded using gzip. Luckily, there is an easy method to decode a gzip encoding stream, using the GZIPInputStream class.

answered Jul 03 '15 at 11:51

TmKVU

2,910
2
16
30

score 0 · Answer 2 · answered Jul 03 '15 at 11:41

0

Don't know which URL you are trying to access, but have you tried setting the charset ?

BufferedReader in = new BufferedReader(new InputStreamReader(
            conn.getInputStream(), "UTF8"));

answered Jul 03 '15 at 11:41

dstibbe

1,589
18
33

Hi, yes I just tried it and it didn't work. I'm not sure if I'm fully correct here but should it be "UTF-8" instead of "UTF8" or does that make a difference? Is there anything else I should have included in the question that would make it easier to assess? – TheFlanman91 Jul 03 '15 at 11:50
@TheFlanman91 the url you are trying to get would be helpful . – dstibbe Jul 03 '15 at 12:04

Receiving encoded response to HttpsURLConnection GET request

2 Answers2