0

I want to scrape the HTML codes from the URL listed below. The problem is, I get this error:-

Aug 14, 2016 6:40:36 PM booksscraper.BooksScraper main SEVERE: null org.jsoup.HttpStatusException: HTTP error fetching URL. Status=504, URL=http://www.bkstr.com/webapp/wcs/stores/servlet/CourseMaterialsResultsView?catalogId=10001&categoryId=9604&storeId=10293&langId=-1&programId=636&termId=100043741&divisionDisplayName=%20&departmentDisplayName=ACCG&courseDisplayName=16971&sectionDisplayName=P15%20DAVIS&demoKey=d&purpose=browse at org.jsoup.helper.HttpConnection$Response.execute(HttpConnection.java:590) at org.jsoup.helper.HttpConnection$Response.execute(HttpConnection.java:540) at org.jsoup.helper.HttpConnection.execute(HttpConnection.java:227) at org.jsoup.helper.HttpConnection.get(HttpConnection.java:216) at booksscraper.BooksScraper.main(BooksScraper.java:52)

I have set the timeout to infinity, but that did not help. The HTML code for this website is extremely large i.e. 14833 lines of code. Is this the reason for the problem?

String url = "http://www.bkstr.com/webapp/wcs/stores/servlet/CourseMaterialsResultsView?catalogId=10001&categoryId=9604&storeId=10293&langId=-1&programId=636&termId=100043741&divisionDisplayName=%20&departmentDisplayName=ACCG&courseDisplayName=16971&sectionDisplayName=P15%20DAVIS&demoKey=d&purpose=browse";

Document doc = Jsoup.connect(url)
                .maxBodySize(0)
                .timeout(0)
                .get();

System.out.println(doc);
Rokin Maharjan
  • 629
  • 7
  • 19

2 Answers2

0

This is not a Jsoup API or your code issue. The reason for the error message is that the URL is not responding and throwing "Gateway Timeout" error message (The proxy server did not receive a timely response from the upstream server).

Exception message from your program:-

HTTP error fetching URL. Status=504

HTTP Status code : 504

504 Gateway Timeout

The server, while acting as a gateway or proxy, did not receive a timely response from the upstream server specified by the URI (e.g. HTTP, FTP, LDAP) or some other auxiliary server (e.g. DNS) it needed to access in attempting to complete the request.

  Note: Note to implementors: some deployed proxies are known to
  return 400 or 500 when DNS lookups time out.
notionquest
  • 37,595
  • 6
  • 111
  • 105
  • Thank you for the answer notionquest. However, the Gateway Timeout only shows when we enter the URL directly. If we go to the URL through this "[URL](http://www.bkstr.com/sheridandavisstore/shop/textbooks-and-course-materials?cm_sp=GlobalJuly122016BTS-_-ShipTextbooks-_-943)", no gateway timeout occurs. How does this happen? – Rokin Maharjan Aug 17 '16 at 13:14
0

I did manage to connect to the website by setting the UserAgent as Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/51.0.2704.106 Safari/537.36. But, it took about 4 minutes to respond.

Rokin Maharjan
  • 629
  • 7
  • 19