1

Essentially, like a bulletproof tank, i want my program to absord 404 errors and keep on rolling, crushing the interwebs and leaving corpses dead and bludied in its wake, or, w/e.

I keep getting this error:

Exception in thread "main" org.jsoup.HttpStatusException: HTTP error fetching URL. Status=404, URL=https://en.wikipedia.org/wiki/Hudson+Township+%28disambiguation%29
at org.jsoup.helper.HttpConnection$Response.execute(HttpConnection.java:537)
at org.jsoup.helper.HttpConnection$Response.execute(HttpConnection.java:493)
at org.jsoup.helper.HttpConnection.execute(HttpConnection.java:205)
at org.jsoup.helper.HttpConnection.get(HttpConnection.java:194)
at Q.Wikipedia_Disambig_Fetcher.all_possibilities(Wikipedia_Disambig_Fetcher.java:29)
at Q.Wikidata_Q_Reader.getQ(Wikidata_Q_Reader.java:54)
at Q.Wikipedia_Disambig_Fetcher.all_possibilities(Wikipedia_Disambig_Fetcher.java:38)
at Q.Wikidata_Q_Reader.getQ(Wikidata_Q_Reader.java:54)
at Q.Runner.main(Runner.java:35)

But I can't understand why because I am checking to see if I have a valid URL before I navigate to it. What about my checking procedure is incorrect?

I tried to examine the other stack overflow questions on this subject but they're not very authoritative, plus I implemented the many of the solutions from this one and this one, so far nothing has worked.

I'm using the apache commons URL validator, this is the code I've been using most recently:

    //get it's normal wiki disambig page
    String URL_check = "https://en.wikipedia.org/wiki/" + associated_alias;

    UrlValidator urlValidator = new UrlValidator();

    if ( urlValidator.isValid( URL_check ) ) 
    {
       Document docx = Jsoup.connect( URL_check ).get();
        //this can handle the less structured ones. 

and

    //check the validity of the URL
    String URL_czech = "https://www.wikidata.org/wiki/Special:ItemByTitle?site=en&page=" + associated_alias + "&submit=Search";

    UrlValidator urlValidator = new UrlValidator();

    if ( urlValidator.isValid( URL_czech ) ) 
    {
        URL wikidata_page = new URL( URL_czech );
        URLConnection wiki_connection = wikidata_page.openConnection();
        BufferedReader wiki_data_pagecontent = new BufferedReader(
                                                   new InputStreamReader(
                                                        wiki_connection.getInputStream()));
Community
  • 1
  • 1
smatthewenglish
  • 2,831
  • 4
  • 36
  • 72
  • Your url is valid, the response is a 404 not found. Read about status codes specifically 404 at http://www.w3.org/Protocols/rfc2616/rfc2616-sec10.html#sec10.4.5 – ug_ Apr 23 '15 at 07:34
  • is there a fast way to make my program ignore those things so it doesnt keep crashing? – smatthewenglish Apr 23 '15 at 07:39
  • 1
    `try`/`catch` is for handling errors in Java (including possibly ignoring them). But there's probably something more going on here that you'll have to dig in to. – Andrew Janke Apr 23 '15 at 07:49

2 Answers2

1

The Status=404 error means there's no page at that location. Just because a URL is valid doesn't mean there's anything there. A validator can't tell you that. The only way you can determine that is by fetching it, and seeing if you get an error, as you're doing.

Andrew Janke
  • 23,508
  • 5
  • 56
  • 85
  • but my program crashes every time. i've seen like, i can fetch the `HEAD` or something, but theres no good example on how to actually implement that. – smatthewenglish Apr 23 '15 at 07:38
  • Do a SO or Google search for "jsoup 404". Jsoup's own documentation doesn't seem very extensive. – Andrew Janke Apr 23 '15 at 07:48
  • @S.Matthew_English Do a Google search for **HTTP** 404. It's not a Java thing, no reason why Java should document it, or Jsoup either. – user207421 Apr 23 '15 at 07:54
  • No, search for "jsoup 404". He is looking to find out specifically how you can handle 404s with this Java client library he's using. "HTTP 404" is good background information, but why you might be getting it with a particular client library and how to fix it involves library-specific behavior and configuration like user agent strings, redirect following behavior, and so on. (Spoiler: I'm pretty sure this is library-specific behavior he's running in to, and there are existing Jsoup-specific SO questions that address it.) – Andrew Janke Apr 23 '15 at 08:06
  • Oh, one intermediate step, which he's probably already done: copy the URL and try going to it in a browser and with another library or tool. If it's broken everywhere, probably the URL at fault. If it's only in this client where it breaks, that suggests it's something more specific to the agent you're using, and want to look more for tool-specific help. – Andrew Janke Apr 23 '15 at 08:10
  • Hi, I just went to go eat, but I'll get into that right away, thank you for the suggestions. I went there in a browser and it works, I was thinking that it would/should just go there and realize that the pattern its seeking doesn't live on that page and then just continue on, but thats not the case – smatthewenglish Apr 23 '15 at 08:39
1

The URLConnection throws an error when the status code of the webpage your downloading returns anything other than 2xx (such as 200 or 201 ect...). Instead of passing Jsoup a URL or String to parse your document consider passing it an input stream of data which contains the webpage.

Using the HttpURLConnection class we can try to download the webpage using getInputStream() and place that in a try/catch block and if it fails attempt to download it via getErrorStream().

Consider this bit of code which will download your wiki page even if it returns 404

String URL_czech = "https://en.wikipedia.org/wiki/Hudson+Township+%28disambiguation%29";

URL wikidata_page = new URL(URL_czech);
HttpURLConnection wiki_connection = (HttpURLConnection)wikidata_page.openConnection();
InputStream wikiInputStream = null;

try {
    // try to connect and use the input stream
    wiki_connection.connect();
    wikiInputStream = wiki_connection.getInputStream();
} catch(IOException e) {
    // failed, try using the error stream
    wikiInputStream = wiki_connection.getErrorStream();
}
// parse the input stream using Jsoup
Jsoup.parse(wikiInputStream, null, wikidata_page.getProtocol()+"://"+wikidata_page.getHost()+"/");
ug_
  • 11,267
  • 2
  • 35
  • 52