Java - Read page source from url returns unknown characters

Question

I am using the code below to read a page source from url (https://www.amazon.com) with "UTF-8" charset in NetBeans, but it returns unknown characters (the attached image). I don't have any idea that what is the problem and would be gratefull if help me to modify the code to work properly? Thanks.

public static String getURLSource(String url) throws IOException
{
    URL urlObject = new URL(url);
    URLConnection urlConnection = urlObject.openConnection();
    urlConnection.setRequestProperty("User-Agent", "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.11 (KHTML, like Gecko) Chrome/23.0.1271.95 Safari/537.11");

    return toString(urlConnection.getInputStream());
}

private static String toString(InputStream inputStream) throws IOException
{
    try (BufferedReader bufferedReader = new BufferedReader(new InputStreamReader(inputStream, "UTF-8")))
    {
        String inputLine;
        StringBuilder stringBuilder = new StringBuilder();
        while ((inputLine = bufferedReader.readLine()) != null)
        {
            stringBuilder.append(inputLine);
        }

        return stringBuilder.toString();
    }
}

I believe you're seeing the encrypted payload. You need to use some sort of an HTTPS client to handle the exchange of keys, the validation of the server's cert and - most importantly - the decoding of the stream. — David, Feb 04 '19 at 15:15
@skomisa thanks for the answer. I have uncommented that line in my code. — Mr. Nobody, Feb 05 '19 at 03:54
@skomisa thanks for the answer. It was just a typo and I have uncommented that line in my code (the problem is not related to that!). — Mr. Nobody, Feb 05 '19 at 04:01
@Mr.Nobody I also tried reading Amazon's home page [using JSoup](https://jsoup.org/) with limited success. There are some [JSoup examples here which use Amazon's home page](https://able.bio/DavidLandup/introduction-to-web-scraping-with-java-jsoup--641yfyl). It seems that Amazon deliberately do not make it easy to scrape their pages. — skomisa, Feb 05 '19 at 04:22
@Mr.Nobody Yes you definitely receive zipped data. See my answer, it gives you the code how to unzip it. I ran it and got the clear text of the amazon page — Michael Gantman, Feb 21 '23 at 11:53

score 1 · Answer 1 · answered Feb 04 '19 at 15:17

1

Use HttpsUrlConnection instead of UrlConnection. See a similar question.

answered Feb 04 '19 at 15:17

David

1,055
8
23

I have already examined it with 'HttpsUrlConnection', but nothing changed! – Mr. Nobody Feb 04 '19 at 17:21
@skomisa Sorry, my bad! – David Feb 05 '19 at 09:21

score 0 · Answer 2 · answered Feb 21 '23 at 10:53

You just need to unzip your content. Here is the code that worked for me

HttpClient httpClient = new HttpClient();
try {
    httpClient.setConnectionUrl("https://www.amazon.com");
    ByteBuffer buff = httpClient.setRequestHeader("User-Agent", "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.11 (KHTML, like Gecko) Chrome/23.0.1271.95 Safari/537.11")
            .sendHttpRequestForBinaryResponse(HttpClient.HttpMethod.GET);
    try (
            ByteArrayInputStream bais = new ByteArrayInputStream(buff.array());
            GZIPInputStream gzis = new GZIPInputStream(bais);
            InputStreamReader isr = new InputStreamReader(gzis);
            BufferedReader br = new BufferedReader(isr)
    ) {
        br.lines().forEach(line -> System.out.println(line));
    }
} catch (Exception e) {
    System.out.println(httpClient.getLastResponseCode() + " "
            + httpClient.getLastResponseMessage() + TextUtils.getStacktrace(e, false));
}

Just few clarifications: In this example I use a 3d party Http client class HttpClient (And also class TextUtils). They both come from Open source MgntUtils library writen and maintained by me. But you don't have to use it. The main part is - read the info from the InputStream as binary info (as byte array or ByteBuffer) and than and unzip it with GZIPInputStream like in my example.

If you do want to use MgntUtils library you can get it As maven artifact or from Github (including source code and Javadoc). and here is Javadoc online

Java - Read page source from url returns unknown characters

2 Answers2