0

Thanks for your time. Somewhat new to OOP and Ruby and after synthesizing solutions from a few different stack overflow answers I've got myself turned around.

My goal is to write a script that parses a CSV of URLs using Nokogiri library. After trying and failing to use open-uri and the open-uri-redirections plugin to follow redirects, I settled on Net::HTTP and that got me moving...until I ran into URLs that have a 302 redirect specifically.

Here's the method I'm using to engage the URL:

require 'Nokogiri'
require 'Net/http'
require 'csv'

def fetch(uri_str, limit = 10)
  # You should choose better exception.
  raise ArgumentError, 'HTTP redirect too deep' if limit == 0

  url = URI.parse(uri_str)
  #puts "The value of uri_str is: #{ uri_str}"
  #puts "The value of URI.parse(uri_str) is #{ url }"
  req = Net::HTTP::Get.new(url.path, { 'User-Agent' => 'Mozilla/5.0 (etc...)' })
  # puts "THE URL IS #{url.scheme + ":" + url.host + url.path}" # just a reporter so I can see if it's mangled
  response = Net::HTTP.start(url.host, url.port, :use_ssl => url.scheme == 'https') { |http| http.request(req) }
  case response
  when Net::HTTPSuccess     then  response
  when Net::HTTPRedirection then fetch(response['location'], limit - 1)
  else
    #puts "Problem clause!"
    response.error!
  end
end

Further down in my script I take an ARGV with the URL csv filename, do CSV.read, encode the URL to a string, then use Nokogiri::HTML.parse to turn it all into something I can use xpath selectors to examine and then write to an output CSV.

Works beautifully...so long as I encounter a 200 response, which unfortunately is not every website. When I run into a 302 I'm getting this:

C:/Ruby24-x64/lib/ruby/2.4.0/Net/http.rb:1570:in `addr_port': undefined method `+' for nil:NilClass (NoMethodError)
        from C:/Ruby24-x64/lib/ruby/2.4.0/Net/http.rb:1503:in `begin_transport'
        from C:/Ruby24-x64/lib/ruby/2.4.0/Net/http.rb:1442:in `transport_request'
        from C:/Ruby24-x64/lib/ruby/2.4.0/Net/http.rb:1416:in `request'
        from httpcsv.rb:14:in `block in fetch'
        from C:/Ruby24-x64/lib/ruby/2.4.0/Net/http.rb:877:in `start'
        from C:/Ruby24-x64/lib/ruby/2.4.0/Net/http.rb:608:in `start'
        from httpcsv.rb:14:in `fetch'
        from httpcsv.rb:17:in `fetch'
        from httpcsv.rb:42:in `block in <main>'
        from C:/Ruby24-x64/lib/ruby/2.4.0/csv.rb:866:in `each'
        from C:/Ruby24-x64/lib/ruby/2.4.0/csv.rb:866:in `each'
        from httpcsv.rb:38:in `<main>'

I know I'm missing something right in front of me but I can't tell what I should puts to see if it is nil. Any help is appreciated, thanks in advance.

  • In failed cases, what is `uri_str` and what is `response` and `response['location']`? – Andrew Schwartz Mar 18 '19 at 13:28
  • Thanks @AndrewSchwartz. I printed those three values into HTTPRedirection clause and respectively they come out as follows: `uri_str: https://www.example.com/page-302-i-cant-follow response: # response[location]: /page-302-i-cant-follow` – user2308493 Mar 18 '19 at 14:20
  • With those values you're getting the `NoMethodError` above? I can't reproduce that (getting, not surprisingly, a 404 not found). If you look at the source code where that error is occurring (http.rb) you should see that it's trying to do `address() + ...`, which, with that error, makes me thing `address` must be `nil`. That won't be the case with this URI. Try to narrow down your code to a few lines that define a URI, `Net::HTTP` object, and reliably reproduces the error you see. – Andrew Schwartz Mar 18 '19 at 15:52
  • @AndrewSchwartz at this point I can't reproduce it. I will point out that the example I used doesn't exist, so it wouldn't produce a 302 but a 404. I'm trying to keep the site I'm working on off the public thread so after some desperate searching I found a 302 link: https://www.servicioshosting.com/sitio/ are you able to reproduce with this one? – user2308493 Mar 18 '19 at 19:49
  • I cannot reproduce it with this URL. If I leave the `http://` off, I get a different error because `URI.parse` doesn't assume a default protocol, but not the same error you reported. I would just come back at it this way if you see the error again. – Andrew Schwartz Mar 18 '19 at 20:43

0 Answers0