2

I need to scrape 10k URLs from this website and some of them are out of service (I think... it's an error that does not return the JSON I'm looking for, so rest-client returns 500 Internal Server error in my program)

Error syntax: `exception_with_response': 500 Internal Server Error (RestClient::InternalServerError)

To loop through the URLs, I'm using a range (1..30).each do |id|. I concatenate the URL with the current iteration of this range.

response = RestClient.get(url+id)

The problem is some times the URL I'm storing in the response variable does not exist and/or the webpage returns some error. How could I protect my code so I can just pass through this problematic URL and keep the scraping?

Here's my current code (I put every code of the loop in a begin/rescue block, but I do not know how do write the code to do such thing):

require 'nokogiri'
require 'csv'
require 'rest-client'
require 'json'

link = "https://webfec.org.br/Utils/GetCentrobyId?cod="
CSV.open('data2.csv', 'ab') do |csv|
    csv << ['Name', 'Street', 'Info', 'E-mail', 'Site']
    (1..30).each do |id|
        begin
            response = RestClient.get(link+id.to_s)
            json = JSON.parse(response)
            html = json["Data"]
            doc = Nokogiri::HTML.parse(html)

            name = doc.xpath("/html/body/table/tbody/tr[1]").text
            street = doc.xpath("/html/body/table/tbody/tr[2]").text.gsub(Regexp.union(REMOVER), " ")
            info = doc.xpath("/html/body/table/tbody/tr[3]").text.gsub(Regexp.union(REMOVER), " ")
            email = doc.xpath("/html/body/table/tbody/tr[4]").text.gsub(Regexp.union(REMOVER), " ")
            site = doc.xpath("/html/body/table/tbody/tr[5]").text.gsub(Regexp.union(REMOVER), " ")

            csv << [name, street, info, email, site]
        rescue

        end
    end
end

You can see I put everything in the loop inside a begin block and there is the rescue block at the end but I'm kind lost on how do to such thing.

  • 1
    Maybe slow down a little as 500 errors are usually a sign the server's not holding up very well to the volume of requests you're submitting. – tadman Feb 14 '20 at 18:57
  • I already did that. I put sleep at end of every iteration, but the problem is because there are urls that isnt working. – Gregory N. M. Feb 14 '20 at 19:44
  • As it is, the code seems to work as intended. When the target returns `RestClient::InternalServerError`, the exceptions will be rescued and the loop will continue... so what problem are you having? – Toribio Feb 14 '20 at 20:56
  • Be very careful using `tbody` in an XPath or CSS selector. `tbody` is missing for a huge number of tables on the internet. Browsers fix-up code prior to displaying it and viewing the source of a page in a browser is then misleading. Always confirm that they're actually used on a particular page using `wget`, `curl` or an HTTP client. – the Tin Man Feb 14 '20 at 22:16
  • You should use a HEAD request to see if the page exists prior to trying to retrieve it. It's easier on the server, their network and yours. – the Tin Man Feb 14 '20 at 22:19
  • `.xpath("/html/body/table/tbody/tr[2]").text` is the source of confusion. See "[How to avoid joining all text from Nodes when scraping](https://stackoverflow.com/q/43594656/128421)" – the Tin Man Feb 14 '20 at 22:32

1 Answers1

2

You should just rescue the exception for exmaple:

[*1..3].each{|i| RestClient.get('https://fooboton.free.beeceptor.com') rescue RestClient::InternalServerError; next}

So for your case do:

CSV.open('data2.csv', 'ab') do |csv|
    csv << ['Name', 'Street', 'Info', 'E-mail', 'Site']
    (1..30).each do |id|
      begin
        response = RestClient.get(link+id.to_s) 
      rescue RestClient::InternalServerError
        next # skip this iteration in your loop
      end
    ... # rest of your code
lacostenycoder
  • 10,623
  • 4
  • 31
  • 48