1

I'm trying to fetch a specific page with Mechanize:

require 'mechanize'

agent = Mechanize.new
p agent.get("http://formitas.si")

but I get this:

`fetch': 500 => Net::HTTPInternalServerError for http://formitas.si/ -- unhandled response (Mechanize::ResponseCodeError)

while the page opens fine in a browser. Why?

the Tin Man
  • 158,662
  • 42
  • 215
  • 303
davidhq
  • 4,660
  • 6
  • 30
  • 40
  • I used `wget` and got this: --2014-03-12 09:30:43-- http://formitas.si/ Resolving formitas.si (formitas.si)... 212.44.99.132 Connecting to formitas.si (formitas.si)|212.44.99.132|:80... connected. HTTP request sent, awaiting response... 500 Internal Server Error 2014-03-12 09:30:44 ERROR 500: Internal Server Error. – squiguy Mar 12 '14 at 16:31
  • The title is misleading. The problem isn't Mechanize returning an error, it's merely the messenger reporting the problem. The server is returning 500 on a valid URL requested by Mechanize. – the Tin Man Mar 12 '14 at 16:36
  • I tried curl and it worked... so curl works and wget/Mechanize don't.... – davidhq Mar 12 '14 at 20:12
  • I'm getting a 500 in chrome. I think your confusion is that it doesn't look like an error page when it loads. – pguardiario Mar 13 '14 at 01:20
  • Wow you're right... I checked now and Chrome is really getting 500 in headers... strange that it shows the page though... – davidhq Mar 13 '14 at 12:57

2 Answers2

0

It's a problem on the server. That's easy to tell because it's a 500-series error.

Here's HTTP request diagnostics 101:

Consider what would be different between a browser and Mechanize that a server could sense. You've got the request URL itself, and the headers that are sent as part of the HTTP request.

The URL itself is easy to visually check so that can be ruled out immediately if you've confirmed it's identical in both Mechanize and the browser.

That leaves the headers. Use a tool to check what headers your browser sends, then look to see what you're using with Mechanize. Make them match.

From experience, I suspect it's a case of the browser's signature, or the acceptable data-types, being different between the browser and Mechanize, and that site not knowing how to handle one or the other.

the Tin Man
  • 158,662
  • 42
  • 215
  • 303
  • Yes, thank you... but it's interesting that curl doesn't work while wget and Mechanize don't... and also very interesting the above answer that makes Mechanize work by connecting straight to the IP address... What can we conclude from all that? I don't know yet. – davidhq Mar 12 '14 at 20:15
  • I conclude that they're sniffing for signatures and banning based on those. They might be allowing cURL because they know what it is. – the Tin Man Mar 13 '14 at 23:39
  • they actually return 500 to Chrome as well (see above) – davidhq Mar 14 '14 at 10:41
0

In the past I've run into an issue with Mechanize being unable to resolve the DNS itself.

Though I'm fairly certain Mechanize uses Resolv to get the underlying site, I too was unable to get agent.get('http://formitas.si') to work.

Instead what I did was explicitly accessed the Resolv library and set the IP as what I access rather than the host name.

require 'mechanize'
require 'Resolv'

@agent = Mechanize.new
address = Resolv.getaddress "formitas.si"
page = @agent.get('http://' + address.to_s) # wouldn't let me use string interpolation on SO

pp page

Which ended up giving me this:

#<Mechanize::Page
 {url #<URI::HTTP:0x007f7f93ec7c68 URL:http://212.44.99.132/>}
 {meta_refresh}
 {title nil}
 {iframes}
 {frames}
 {links #<Mechanize::Page::Link "" "http://www.parallels.com/plesk/">}
 {forms}>
Jeff LaJoie
  • 1,725
  • 2
  • 17
  • 27
  • but this approach is not useful when trying to fetch anything else than a root page... :( also if this works and normal approach returns 500 error, something seems to be broken in Mechanize :/ – davidhq Mar 12 '14 at 23:02
  • Whoa, you're trying to fetch other pages now?! That wasn't in the description! But all sarcasm aside I don't think I can be of much more help. Sites don't want you to scrape them more often than not, it's very likely not a mechanize issue but a preventative measure put in place by the site itself. Quick edit: You're probably best off using a full browser emulator like Watir/Selenium if you're having issues on the site with mechanize. They actually handle Javascript, so if the data is loaded that way, well then you can actually get it. – Jeff LaJoie Mar 13 '14 at 19:31
  • :) ok... but regarding this site... no, they're not trying to put measures against scraping in, they're just incompetent... they actually return 500 to Chrome as well (see above)... but Chrome still displays the site... it looks like that's how it works, didn't know that. – davidhq Mar 13 '14 at 23:13