2

I'm using the built-in CSV function in Ruby for my Rails app. I am calling a URL (via HTTParty) parsing it, and attempting to save the results to my database.

The problem is, I'm getting the error Unquoted fields do not allow \r or \n that indicates, normally, there is a problem with the input data, but in examining the data, I can't find any issues.

Here's how I retrieve data:

response = HTTParty.get("http://" + "weather.com/ads.txt", limit: 100, follow_redirects: true, timeout: 10)

(This data is publicly available at the url weather.com/ads.txt)

I then attempt to parse the data, with some regex applied to ignore everything after a #, ignore blank lines, etc.

if response.code == 200 && !response.body.match(/<.*html>/) active_policies = []

CSV.parse(response.body, skip_blanks: true, skip_lines: /(^\s*#|^\s*$|^contact=|^CONTACT=|^subdomain=)/) do |row|
    begin
     #print out the individual ads.txt records 
     puts ""
     print row[0].downcase.strip + " " + row[1].strip + " " + 
     row[2].split("#").first.strip
            active_policies.push(
                publisher.policies.find_or_create_by(ad_partner: row[0].downcase.strip, external_seller_id: row[1].strip, seller_relationship: row[2].split("#").first.strip) do |policy|
                    policy.deactivated_at = nil
                end 
                )

                rescue => save
                #Add error event to the new sync status model
                puts "we are in the loop"
                puts save.message, row.inspect, save.backtrace
                    next
                end
                end
            #else
                #puts "Too many policies.  Skipping " + publisher.name
            #end
            #now we are going to run a check to see if we have any policies that are outdated, and if so, flag them as such.
            deactivated_policies = publisher.policies.where.not(id: active_policies.map(&:id)).where(deactivated_at: nil)
            deactivated_policies.update_all(deactivated_at: Time.now)
            deactivated_policies.each do |deactivated_policy|
                puts "Deactivating Policy for " + deactivated_policy.publisher.name
            end

         elsif response.code == 404 
            print 
            print response.code.to_s + " GET, "  + response.body.size.to_s + " body, "
            puts response.headers.size.to_s + " headers for " + publisher.name

         elsif response.code == 302
            print response.code.to_s + " GET, "  + publisher.name
         else 
            puts response.code.to_s +  " GET ads txt not found on " + publisher.name
         end

    publisher.update(last_scan: Time.now)

    rescue => ex
        puts ex.message, ex.backtrace, "error pulling #{publisher.name} ..." 
        #publisher.update_columns(active: "false")
    end
end`

A couple of ideas I've had / results of investigation:

  1. I have tried going through this line by line, and I've identified that line 134 is what is breaking the scan. I did this by doing a manual check like so: CSV.parse(response.body.lines[140..400].join("\n"), skip_blanks: true, skip_lines: /(^\s*#|^\s*$|^contact=|^CONTACT=|^subdomain=)/) but this doesn't help me, because even if I identify line 134 as the offending line, I don't know how to detect or deal with it.

    1. I noticed that the source file (at weather.com/ads.txt) has unusual characters, but even forcing it to utf-8 via response.body.force_encoding("UTF-8") still throws the error.

    2. I tried adding next to the rescue block, so even if it finds an error, it will move on to the next row in the csv, but that doesn't happen - it just errors out and stops parsing - so I get the first 130~ entries, but not the remaining ones.

    3. Similar to page type, I'm not sure if the page type being HTML and not a text file is creating a problem here.

I would love to know how to detect and handle this error, so any thoughts here would be most welcome!

For reference, #PBS is apparently the line 134 that is giving me trouble in the source file, but I don't know if I fully trust that this is the issue.

#canada

google.com, pub-0942427266003794, DIRECT, f08c47fec0942fa0
indexexchange.com, 184315, DIRECT
indexexchange.com, 184601, DIRECT
indexexchange.com, 182960, DIRECT
openx.com, 539462051, DIRECT, 6a698e2ec38604c6

#spain

#PBS
google.com, pub-8750086020675820, DIRECT, f08c47fec0942fa0
google.com, pub-1072712229542583, DIRECT, f08c47fec0942fa0
appnexus.com, 3872, DIRECT
rubiconproject.com, 9778, DIRECT, 0bfd66d529a55807
openx.com, 539967419, DIRECT, 6a698e2ec38604c6
openx.com, 539726051, DIRECT, 6a698e2ec38604c6
google.com, pub-7442858011436823, DIRECT, f08c47fec0942fa0
sulleh
  • 321
  • 5
  • 14

2 Answers2

8

There are inconsistent line endings in that text, and the CSV parser is stumbling over them. A very quick fix is to remove all \r characters with:

response.body.gsub!("\r", '')

In case you are curious, one way to see the errant characters is the following code that writes the Ruby array notation of every character to a text file:

response = HTTParty.get("http://" + "weather.com/ads.txt", limit: 100, follow_redirects: true, timeout: 10)
characters = response.chars.inspect
output = File.open( "outputfile.txt","w" )
output << characters
output.close

Open up outputfile.txt and search for \r characters. I find just a couple of them at line endings, though all other lines are ended with \n alone.

John Skiles Skinner
  • 1,611
  • 1
  • 8
  • 21
  • the `response.body.gsub!("\r", '')` solved it! I'm curious, where is the location of outputfile.txt? apologies if that's a dumb question, but I have another issue that I suspect is the same problem - I'm getting the error `Illegal quoting in line 540.` for techradar.com/ads.txt, and I think there are hidden / errant characters present here too, just not line breaks – sulleh Aug 23 '18 at 15:14
  • The other answer, by @bkimble, uses `tr()` rather than `gsub()`, which is probably a more efficient choice for a single character removal like this. The `outputfile.txt` would be created by that code in the Rails project root directory. The line number 540 that you are a seeing may be a little misleading, because some of the lines are being skipped before CSV assigns line numbers. The illegal quoting is actually on line 580: `tremorhub.com, q017o-78mlk, RESELLER, 1a4e959a1b50034a # Premium video demand aka "Telaria" #` – John Skiles Skinner Aug 23 '18 at 15:59
  • ok I'll make the changes to use `tr()` - to avoid the "Telaria" issue, would I control for quotes via regex? – sulleh Aug 23 '18 at 20:15
  • 1
    You could, I guess. You could also use `tr()` or `gsub()` to remove all quotation mark charters. Though either of these could be messy because quotation marks are also a legitimate delimiter in .csv files, just not used in this one. I notice that the quotes that are giving you trouble are in the text after the `#` that you are removing. But, you're removing it after CSV is parsing the lines. Consider removing it beforehand. – John Skiles Skinner Aug 23 '18 at 20:25
2

It looks like whats happening is that the input file's lines are terminated with \n, except for lines 134 and 135, which are terminated with \r\n. CSV by default has its :row_sep set to :auto, which looks at the file to determine which separator is most appropriate, and it selected "\n". Those extra carriage returns make it think that you have a multi line field which is not encapsulated by quotes.

You can fix this by pre parsing the file before CSV gets to it, and removing the \r:

Change:

CSV.parse(response.body, skip_blanks: true, skip_lines: /(^\s*#|^\s*$|^contact=|^CONTACT=|^subdomain=)/) do |row|

to:

CSV.parse(response.body.tr("\r", ''), skip_blanks: true, skip_lines: /(^\s*#|^\s*$|^contact=|^CONTACT=|^subdomain=)/) do |row|
Billy Kimble
  • 798
  • 3
  • 9