0

I am working on an Android app which will connect to a webpage using the java class HttpsURLConnection and parse the HTML response using JSoup. The issue is that the HTML response from the website appears to be encoded. Any ideas on what I can do to get the actual HTML?

Here is my code for contacting the website:

private String GetPageContent(String url) throws Exception {

        URL obj = new URL(url);
        conn = (HttpsURLConnection) obj.openConnection();

        // default is GET
        conn.setRequestMethod("GET");

        conn.setUseCaches(false);

        // act like a browser
        conn.setRequestProperty("User-Agent", USER_AGENT);
        conn.setRequestProperty("Accept",
                "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8");
        conn.setRequestProperty("Accept-Language", "en-US,en;q=0.8,en-GB;q=0.6");
        conn.setRequestProperty("Accept-Encoding" , "gzip, deflate, sdch");
        conn.setRequestProperty("Connection" , "keep-alive");

        if (cookies != null) {
            for (String cookie : this.cookies) {
                conn.addRequestProperty("Cookie", cookie.split(";", 1)[0]);
            }
        }
        int responseCode = conn.getResponseCode();
        Log.v(TAG,"\nSending 'GET' request to URL : " + url);
        Log.v(TAG,"Response Code : " + responseCode);

        BufferedReader in = new BufferedReader(new InputStreamReader(
                conn.getInputStream()));
        String inputLine;
        StringBuffer response = new StringBuffer();

        while ((inputLine = in.readLine()) != null) {
            response.append(inputLine);
        }
        in.close();

        // Get the response cookies
        setCookies(conn.getHeaderFields().get("Set-Cookie"));

        return response.toString();

    }

And a snippet of the response:

��������������]�r�6��۞�w@ՙ�NDQ�ﱥ|�siv�Kkw�m&�HH�M,  Z��ff_c_o�d�@���9�l�6����� �_=w|����/A{��!W� LZ��������f]�=wc߽�2,˨�|�8x��~�}�x1�$Ib�Uq�7�j�X|;��K

EDIT: The HTML was encoded with GZIP, as shown in the request headers here.

The solution to this issue was to use the GZIPInputStream class as shown below:

BufferedReader in = new BufferedReader(new InputStreamReader(
                new GZIPInputStream(conn.getInputStream())));
Vadim Kotov
  • 8,084
  • 8
  • 48
  • 62
TheFlanman91
  • 141
  • 1
  • 2
  • 11
  • It might be possible that the inputstream is using a different characterset than the default one. You might want to try out some of the Character sets as explained in the [documentation](http://docs.oracle.com/javase/7/docs/api/java/io/InputStreamReader.html#InputStreamReader(java.io.InputStream,%20java.nio.charset.Charset)) – TmKVU Jul 03 '15 at 11:29
  • I just changed the InputStream to UTF-8 which seems to be what the website is using as seen in it's response headers here: http://pasteboard.co/1Gi6ZjY8.png. But to no avail. Still getting the same encoded response – TheFlanman91 Jul 03 '15 at 11:44

2 Answers2

1

Based on the headers returned with the request, we can conclude that the content is encoded using gzip. Luckily, there is an easy method to decode a gzip encoding stream, using the GZIPInputStream class.

TmKVU
  • 2,910
  • 2
  • 16
  • 30
0

Don't know which URL you are trying to access, but have you tried setting the charset ?

BufferedReader in = new BufferedReader(new InputStreamReader(
            conn.getInputStream(), "UTF8"));
dstibbe
  • 1,589
  • 18
  • 33
  • Hi, yes I just tried it and it didn't work. I'm not sure if I'm fully correct here but should it be "UTF-8" instead of "UTF8" or does that make a difference? Is there anything else I should have included in the question that would make it easier to assess? – TheFlanman91 Jul 03 '15 at 11:50
  • @TheFlanman91 the url you are trying to get would be helpful . – dstibbe Jul 03 '15 at 12:04