0

I have a CSV that I like to save all my hash values on it. I am using nokogiri sax to parse a xml document and then save it to a CSV.

The sax parser:

require 'rubygems'
require 'nokogiri'
require 'csv'

class MyDocument < Nokogiri::XML::SAX::Document

  HEADERS = [ :titles, :identifier, :typeOfLevel, :typeOfResponsibleBody, 
              :type, :exact, :degree, :academic, :code, :text ]

  def initialize
     @infodata = {}
     @infodata[:titles] = Array.new([])
  end

  def start_element(name, attrs)
    @attrs = attrs
    @content = ''
  end
  def end_element(name)
    if name == 'title'
      Hash[@attrs]["xml:lang"]
      @infodata[:titles] << @content
      @content = nil
    end
    if name == 'identifier'
       @infodata[:identifier] = @content
       @content = nil
    end
    if name == 'typeOfLevel'
       @infodata[:typeOfLevel] = @content
       @content = nil
    end
    if name == 'typeOfResponsibleBody'
       @infodata[:typeOfResponsibleBody] = @content
       @content = nil
    end
    if name == 'type'
       @infodata[:type] = @content
       @content = nil
    end
    if name == 'exact'     
       @infodata[:exact] = @content
       @content = nil
    end
    if name == 'degree'
       @infodata[:degree] = @content
       @content = nil
    end
    if name == 'academic'
       @infodata[:academic] = @content
       @content = nil
    end
    if name == 'code'
       Hash[@attrs]['source="vhs"']
       @infodata[:code] = @content 
       @content = nil
    end
    if name == 'ct:text'
       @infodata[:beskrivning] = @content
       @content = nil
    end 
  end
  def characters(string)
    @content << string if @content
  end
  def cdata_block(string)
    characters(string)
  end
  def end_document
    File.open("infodata.csv", "ab") do |f|
      csv = CSV.generate_line(HEADERS.map {|h| @infodata[h] })
      csv << "\n"
      f.write(csv)
    end
  end
end

creating new an object for every file that is store in a folder(47.000xml files):

parser = Nokogiri::XML::SAX::Parser.new(MyDocument.new)
counter = 0

Dir.glob('/Users/macbookpro/Desktop/sax/info_xml/*.xml') do |item|
  parser.parse(File.open(item, 'rb'))
  counter += 1
  puts "Writing file nr: #{counter}"
end

The issue: I dont get a new line for every new set of values. Any ideas?

3 xml files for trying the code: https://gist.github.com/2378898 https://gist.github.com/2378901 https://gist.github.com/2378904

2 Answers2

3

You need to open the file using "a" mode (opening a file with "w" clears any previous content).

Appending an array to the csv object will automatically insert newlines. Hash#values returns an array of the values, but it would be safer to force the order. Flattening the array will potentially lead to misaligned columns (e.g. [[:title1, :title2], 'other-value'] will result in [:title1, :title2, 'other-value']). Try something like this:

HEADERS = [:titles, :identifier, ...]

def end_document
  # with ruby 1.8.7
  File.open("infodata.csv", "ab") do |f|
    csv = CSV.generate_line(HEADERS.map { |h| @infodata[h] })
    csv << "\n"
    f.write(csv)
  end
  # with ruby 1.9.x
  CSV.open("infodata.csv", "ab") do |csv|
    csv << HEADERS.map { |h| @infodata[h] }
  end
end

The above change can be verified by executing the following:

require "csv"

class CsvAppender

  HEADERS = [ :titles, :identifier, :typeOfLevel, :typeOfResponsibleBody, :type,
              :exact, :degree, :academic, :code, :text ]

  def initialize
    @infodata = { :titles => ["t1", "t2"], :identifier => 0 }
  end

  def end_document
    @infodata[:identifier] += 1

    # with ruby 1.8.7
    File.open("infodata.csv", "ab") do |f|
      csv = CSV.generate_line(HEADERS.map { |h| @infodata[h] })
      csv << "\n"
      f.write(csv)
    end
    # with ruby 1.9.x
    #CSV.open("infodata.csv", "ab") do |csv|
    #  csv << HEADERS.map { |h| @infodata[h] }
    #end
  end

end

appender = CsvAppender.new

3.times do
  appender.end_document
end

File.read("infodata.csv").split("\n").each do |line|
  puts line
end

After running the above the infodata.csv file will contain:

"[""t1"", ""t2""]",1,,,,,,,,
"[""t1"", ""t2""]",2,,,,,,,,
"[""t1"", ""t2""]",3,,,,,,,,
cydparser
  • 2,057
  • 1
  • 16
  • 12
  • Hi mate, your code do the same thing like my code. And doesnt create a new line for every new set of values –  Apr 12 '12 at 19:19
  • Which version of ruby are you using? Changing the file mode to "ab" works for me with both 1.9.2p290 and 1.9.3-p0. Does your code open infodata.csv in write mode at any other place? I will update the answer to include the code used to verify the fix. – cydparser Apr 13 '12 at 16:24
  • I use ruby v 1.8.7, I get an ArgumentError: 'mode' must be 'r', 'rb', 'w', or 'wb' –  Apr 13 '12 at 16:36
  • Ah, ruby 1.8.7 uses a different csv lib. I'll update the answer. – cydparser Apr 13 '12 at 17:30
  • I tried you code, but it write out the same xml-file over and over again. And I have provide the hole code so you can have a look –  Apr 13 '12 at 18:02
  • I have provide you with some xml example also. –  Apr 13 '12 at 18:22
  • Thanks paul, you code did what I ask the first time. The thing was that I needed to create a new parser-object for every new xml file, and I was using the same parser-object. Hope to see more of you in stackoverflow! cheers! –  Apr 16 '12 at 12:28
1

I guess you need an extra loop. Something similar to

CSV.open("infodata.csv", "wb") do |csv|    
  csv << @infodata.keys
  @infodata.each do |key, value|
    csv << value
  end
end
Toon
  • 43
  • 3