2

I'm trying to get some video game data from Metacritic and I keep on getting a 404 error on this webpage:

http://www.metacritic.com/game/playstation-2/ico

The connect command is very basic:

Jsoup.connect(url).userAgent("Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/27.0.1453.110 Safari/537.36").timeout(0).get();

Out of the hundreds of similar video game webpages on Metacritic I've tried connecting to, that's the only one that returns the 404 every time. Any idea why?

heisenbergman
  • 1,459
  • 4
  • 16
  • 33

2 Answers2

7

The server is returning a 404.

$ curl -I http://www.metacritic.com/game/playstation-2/ico
HTTP/1.1 404 Not Found
Content-Type: text/html; charset=UTF-8
Server: Apache
X-Varnish: 868026494
Date: Tue, 10 Sep 2013 15:26:21 GMT
Connection: keep-alive

The fact that it also returns non-404-looking content doesn't affect Jsoup; it's just looking at the code the server gives in the HTTP-header.

Welcome to the craptastic "how does anything work?!" world of the internets. :) Interestingly, curl -I http://www.metacritic.com/game/playstation-2/SDKFJSDF returns an HTTP-header code of 200 OK yet displays a page whose content says 404. Did I mention the internets is full of crap?

You can ignore these errors by invoking ignoreHttpErrors(true) on the Connection.Request object.

yshavit
  • 42,327
  • 7
  • 87
  • 124
0

I realize this is pretty late for your question, but I encountered this today and finally realize where Metacritic screwed up. It looks like they have an apache configuration to provide 404 errors whenever an *ico file (or most images) is requested. They likely have something like this set up:

RewriteRule (js|ico|gif|jpg|png|css|xml)$ - [R=404,L,NC]]

And they're missing a period before those extensions. Thus, anything that ends in those words, even if they're part of the game name, are returning 404s with content. Proof:

$ curl -I -H 'User-Agent: Mozilla...' 'http://www.metacritic.com/game/pc/foojpg'
HTTP/1.1 404 Not Found

$ curl -I -H 'User-Agent: Mozilla...' 'http://www.metacritic.com/game/pc/foojpgz'
HTTP/1.1 200 OK

$ curl -I -H 'User-Agent: Mozilla...' 'http://www.metacritic.com/game/pc/fooxml'
HTTP/1.1 404 Not Found

$ curl -I -H 'User-Agent: Mozilla...' 'http://www.metacritic.com/game/pc/foocss'
HTTP/1.1 404 Not Found

$ curl -I -H 'User-Agent: Mozilla...' 'http://www.metacritic.com/game/pc/foojs'
HTTP/1.1 404 Not Found

$ curl -I -H 'User-Agent: Mozilla...' 'http://www.metacritic.com/game/pc/fooico'
HTTP/1.1 404 Not Found

$ curl -I -H 'User-Agent: Mozilla...' 'http://www.metacritic.com/game/pc/fooicoo'
HTTP/1.1 200 OK

Which I find kind of amusing :) Anyway, mystery solved.

Brett Woodward
  • 323
  • 3
  • 12