I need to scrape 10k URLs from this website and some of them are out of service (I think... it's an error that does not return the JSON I'm looking for, so rest-client
returns 500 Internal Server error
in my program)
Error syntax: `exception_with_response': 500 Internal Server Error (RestClient::InternalServerError)
To loop through the URLs, I'm using a range (1..30).each do |id|
. I concatenate the URL with the current iteration of this range.
response = RestClient.get(url+id)
The problem is some times the URL I'm storing in the response variable does not exist and/or the webpage returns some error. How could I protect my code so I can just pass through this problematic URL and keep the scraping?
Here's my current code (I put every code of the loop in a begin/rescue block, but I do not know how do write the code to do such thing):
require 'nokogiri'
require 'csv'
require 'rest-client'
require 'json'
link = "https://webfec.org.br/Utils/GetCentrobyId?cod="
CSV.open('data2.csv', 'ab') do |csv|
csv << ['Name', 'Street', 'Info', 'E-mail', 'Site']
(1..30).each do |id|
begin
response = RestClient.get(link+id.to_s)
json = JSON.parse(response)
html = json["Data"]
doc = Nokogiri::HTML.parse(html)
name = doc.xpath("/html/body/table/tbody/tr[1]").text
street = doc.xpath("/html/body/table/tbody/tr[2]").text.gsub(Regexp.union(REMOVER), " ")
info = doc.xpath("/html/body/table/tbody/tr[3]").text.gsub(Regexp.union(REMOVER), " ")
email = doc.xpath("/html/body/table/tbody/tr[4]").text.gsub(Regexp.union(REMOVER), " ")
site = doc.xpath("/html/body/table/tbody/tr[5]").text.gsub(Regexp.union(REMOVER), " ")
csv << [name, street, info, email, site]
rescue
end
end
end
You can see I put everything in the loop inside a begin
block and there is the rescue
block at the end but I'm kind lost on how do to such thing.