I'm using the open-uri module which allows https redirects.
What I'm trying to do is open every page from a domain. I do this by first crawling it through anemone:
require 'anemone'
require "./open_uri"
class Query
def initialize()
fs = File.read("file.json");
string = JSON.parse(fs);
string["items"].each do |item|
Anemone.crawl("http://" + item["displayLink"] + "/") do |anemone|
anemone.on_every_page do |page|
#p page.url
begin
OpenURI.open_uri(page.url) do |f|
f.each_line do |line|
p line
end
end
rescue
p "404"
next
end
end
end
p "---------------------------------------------------------"
end
end
end
qs = Query.new()
I'm trying to open it and then print every line to the console however it looks as if all is printed in my console is 404. Looking at my code this would mean that the open_uri fails to open any of links even though they are valid as far as I'm aware.
What am I missing here?
Also
rescue Exception=> e
p e
end
prints out to the console the following:
#<OpenURI::HTTPError: 404 Not Found>
- UPDATE
As advised in the comments I tried to curl the links that get 404 error and the console in the output does not return a 404 page. I tried about 40 of the returned links and none of them after being curl in the console return 404. Any ideas?