1

I am using the code below to read a page source from url (https://www.amazon.com) with "UTF-8" charset in NetBeans, but it returns unknown characters (the attached image). I don't have any idea that what is the problem and would be gratefull if help me to modify the code to work properly? Thanks.

enter image description here

public static String getURLSource(String url) throws IOException
{
    URL urlObject = new URL(url);
    URLConnection urlConnection = urlObject.openConnection();
    urlConnection.setRequestProperty("User-Agent", "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.11 (KHTML, like Gecko) Chrome/23.0.1271.95 Safari/537.11");

    return toString(urlConnection.getInputStream());
}

private static String toString(InputStream inputStream) throws IOException
{
    try (BufferedReader bufferedReader = new BufferedReader(new InputStreamReader(inputStream, "UTF-8")))
    {
        String inputLine;
        StringBuilder stringBuilder = new StringBuilder();
        while ((inputLine = bufferedReader.readLine()) != null)
        {
            stringBuilder.append(inputLine);
        }

        return stringBuilder.toString();
    }
}
MC Emperor
  • 22,334
  • 15
  • 80
  • 130
Mr. Nobody
  • 327
  • 2
  • 8
  • 21
  • I believe you're seeing the encrypted payload. You need to use some sort of an HTTPS client to handle the exchange of keys, the validation of the server's cert and - most importantly - the decoding of the stream. – David Feb 04 '19 at 15:15
  • @skomisa thanks for the answer. I have uncommented that line in my code. – Mr. Nobody Feb 05 '19 at 03:54
  • @skomisa thanks for the answer. It was just a typo and I have uncommented that line in my code (the problem is not related to that!). – Mr. Nobody Feb 05 '19 at 04:01
  • @Mr.Nobody I also tried reading Amazon's home page [using JSoup](https://jsoup.org/) with limited success. There are some [JSoup examples here which use Amazon's home page](https://able.bio/DavidLandup/introduction-to-web-scraping-with-java-jsoup--641yfyl). It seems that Amazon deliberately do not make it easy to scrape their pages. – skomisa Feb 05 '19 at 04:22
  • Thanks again @skomisa , I will try using 'Jsoup'. – Mr. Nobody Feb 05 '19 at 05:58
  • 1
    could it be you are receiving gzipped data? – Wolfgang Feb 05 '19 at 08:48
  • @Wolfgang Thanks for comment. I don't know! – Mr. Nobody Feb 05 '19 at 09:44
  • @Mr.Nobody Yes you definitely receive zipped data. See my answer, it gives you the code how to unzip it. I ran it and got the clear text of the amazon page – Michael Gantman Feb 21 '23 at 11:53

2 Answers2

1

Use HttpsUrlConnection instead of UrlConnection. See a similar question.

David
  • 1,055
  • 8
  • 23
0

You just need to unzip your content. Here is the code that worked for me

HttpClient httpClient = new HttpClient();
try {
    httpClient.setConnectionUrl("https://www.amazon.com");
    ByteBuffer buff = httpClient.setRequestHeader("User-Agent", "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.11 (KHTML, like Gecko) Chrome/23.0.1271.95 Safari/537.11")
            .sendHttpRequestForBinaryResponse(HttpClient.HttpMethod.GET);
    try (
            ByteArrayInputStream bais = new ByteArrayInputStream(buff.array());
            GZIPInputStream gzis = new GZIPInputStream(bais);
            InputStreamReader isr = new InputStreamReader(gzis);
            BufferedReader br = new BufferedReader(isr)
    ) {
        br.lines().forEach(line -> System.out.println(line));
    }
} catch (Exception e) {
    System.out.println(httpClient.getLastResponseCode() + " "
            + httpClient.getLastResponseMessage() + TextUtils.getStacktrace(e, false));
}

Just few clarifications: In this example I use a 3d party Http client class HttpClient (And also class TextUtils). They both come from Open source MgntUtils library writen and maintained by me. But you don't have to use it. The main part is - read the info from the InputStream as binary info (as byte array or ByteBuffer) and than and unzip it with GZIPInputStream like in my example.

If you do want to use MgntUtils library you can get it As maven artifact or from Github (including source code and Javadoc). and here is Javadoc online

Michael Gantman
  • 7,315
  • 2
  • 19
  • 36