I'm using the built-in CSV function in Ruby for my Rails app. I am calling a URL (via HTTParty) parsing it, and attempting to save the results to my database.
The problem is, I'm getting the error Unquoted fields do not allow \r or \n
that indicates, normally, there is a problem with the input data, but in examining the data, I can't find any issues.
Here's how I retrieve data:
response = HTTParty.get("http://" + "weather.com/ads.txt", limit: 100, follow_redirects: true, timeout: 10)
(This data is publicly available at the url weather.com/ads.txt)
I then attempt to parse the data, with some regex applied to ignore everything after a #
, ignore blank lines, etc.
if response.code == 200 && !response.body.match(/<.*html>/)
active_policies = []
CSV.parse(response.body, skip_blanks: true, skip_lines: /(^\s*#|^\s*$|^contact=|^CONTACT=|^subdomain=)/) do |row|
begin
#print out the individual ads.txt records
puts ""
print row[0].downcase.strip + " " + row[1].strip + " " +
row[2].split("#").first.strip
active_policies.push(
publisher.policies.find_or_create_by(ad_partner: row[0].downcase.strip, external_seller_id: row[1].strip, seller_relationship: row[2].split("#").first.strip) do |policy|
policy.deactivated_at = nil
end
)
rescue => save
#Add error event to the new sync status model
puts "we are in the loop"
puts save.message, row.inspect, save.backtrace
next
end
end
#else
#puts "Too many policies. Skipping " + publisher.name
#end
#now we are going to run a check to see if we have any policies that are outdated, and if so, flag them as such.
deactivated_policies = publisher.policies.where.not(id: active_policies.map(&:id)).where(deactivated_at: nil)
deactivated_policies.update_all(deactivated_at: Time.now)
deactivated_policies.each do |deactivated_policy|
puts "Deactivating Policy for " + deactivated_policy.publisher.name
end
elsif response.code == 404
print
print response.code.to_s + " GET, " + response.body.size.to_s + " body, "
puts response.headers.size.to_s + " headers for " + publisher.name
elsif response.code == 302
print response.code.to_s + " GET, " + publisher.name
else
puts response.code.to_s + " GET ads txt not found on " + publisher.name
end
publisher.update(last_scan: Time.now)
rescue => ex
puts ex.message, ex.backtrace, "error pulling #{publisher.name} ..."
#publisher.update_columns(active: "false")
end
end`
A couple of ideas I've had / results of investigation:
I have tried going through this line by line, and I've identified that line 134 is what is breaking the scan. I did this by doing a manual check like so:
CSV.parse(response.body.lines[140..400].join("\n"), skip_blanks: true, skip_lines: /(^\s*#|^\s*$|^contact=|^CONTACT=|^subdomain=)/)
but this doesn't help me, because even if I identify line 134 as the offending line, I don't know how to detect or deal with it.I noticed that the source file (at weather.com/ads.txt) has unusual characters, but even forcing it to utf-8 via
response.body.force_encoding("UTF-8")
still throws the error.I tried adding
next
to the rescue block, so even if it finds an error, it will move on to the next row in the csv, but that doesn't happen - it just errors out and stops parsing - so I get the first 130~ entries, but not the remaining ones.Similar to page type, I'm not sure if the page type being HTML and not a text file is creating a problem here.
I would love to know how to detect and handle this error, so any thoughts here would be most welcome!
For reference, #PBS
is apparently the line 134 that is giving me trouble in the source file, but I don't know if I fully trust that this is the issue.
#canada
google.com, pub-0942427266003794, DIRECT, f08c47fec0942fa0
indexexchange.com, 184315, DIRECT
indexexchange.com, 184601, DIRECT
indexexchange.com, 182960, DIRECT
openx.com, 539462051, DIRECT, 6a698e2ec38604c6
#spain
#PBS
google.com, pub-8750086020675820, DIRECT, f08c47fec0942fa0
google.com, pub-1072712229542583, DIRECT, f08c47fec0942fa0
appnexus.com, 3872, DIRECT
rubiconproject.com, 9778, DIRECT, 0bfd66d529a55807
openx.com, 539967419, DIRECT, 6a698e2ec38604c6
openx.com, 539726051, DIRECT, 6a698e2ec38604c6
google.com, pub-7442858011436823, DIRECT, f08c47fec0942fa0