0

I'm using the open-uri module which allows https redirects.

What I'm trying to do is open every page from a domain. I do this by first crawling it through anemone:

require 'anemone'
require "./open_uri"

class Query
  def initialize()
    fs = File.read("file.json");
    string = JSON.parse(fs);
    string["items"].each do |item|
      Anemone.crawl("http://" + item["displayLink"] + "/") do |anemone|
        anemone.on_every_page do |page|
          #p page.url
          begin
            OpenURI.open_uri(page.url) do |f|
              f.each_line do |line|
                p line
              end
            end
          rescue                        
            p "404"
            next
          end
        end                 
      end
      p "---------------------------------------------------------"
    end
  end
end

qs = Query.new()

I'm trying to open it and then print every line to the console however it looks as if all is printed in my console is 404. Looking at my code this would mean that the open_uri fails to open any of links even though they are valid as far as I'm aware.

What am I missing here?

Also

rescue Exception=> e
 p e
end

prints out to the console the following:

#<OpenURI::HTTPError: 404 Not Found>
  • UPDATE

As advised in the comments I tried to curl the links that get 404 error and the console in the output does not return a 404 page. I tried about 40 of the returned links and none of them after being curl in the console return 404. Any ideas?

Bula
  • 2,398
  • 5
  • 28
  • 54
  • Are you sure that it is a 404? You print '404' without checking the actual exception... – Uri Agassi Mar 26 '14 at 12:09
  • Good question. I forgot to mention this. Check updates. – Bula Mar 26 '14 at 12:12
  • You can also print failed URLs, and check for yourself (using `curl` or a web browser) – Uri Agassi Mar 26 '14 at 12:15
  • Hm. Interesting. I printed some of the urls and they all end with ">" which makes them invalid. I then tried to print all of the urls no matter weather they return 404 and I have found that all urls have the corect typo but with a ">" after them. Is this a bug in anemone? – Bula Mar 26 '14 at 12:23
  • Also after deleting the ">" and curl ing the link it does return what I want – Bula Mar 26 '14 at 12:24
  • 1
    Then you can probably hack it by using `OpenURI.open_uri(page.url[0..-2])` ? – Trygve Flathen Mar 26 '14 at 12:56
  • Looks like this doesn't work. It still take the url with the ">" – Bula Mar 26 '14 at 13:49
  • @UriAgassi I have tried to curl them and they return a perfectly normal html page. ( not 404 ) . Might it be a problem with the module? If so then what could I use to allow https redirects? – Bula Mar 26 '14 at 14:35
  • You are asking two questions - redirect urls do _not_ return 404, they return a 3XX return code. Anyway, I believe that `anemone` does the redirects... As for the `404`, after you remove the ">", you still get it in your code? If yes, is writing hard coded (`OpenURI.open_uri("http://one.of.those.urls")`) also return 404? – Uri Agassi Mar 26 '14 at 14:44
  • @UriAgassi Yes for some reason uri returns 404 for everything that I do. I have started using Typhoeus and it works like a charm – Bula Mar 28 '14 at 20:03

0 Answers0