0

I am scraping a webpage at http://h10010.www1.hp.com/wwpc/ie/en/ho/WF06b/321957-321957-3329742-89318-89318-5186820-5231694.html?dnr=1

First I create my array of desired keywords (clues), I then perform an Xpath query feeding the results into CSV. Everything works smoothly but the spreadsheet needs better formatting so end users can copy and paste

Is there a way I can achieve my desired look using either CSV or Axslx

My code is below:

require 'rubygems'
require 'nokogiri'   
require 'open-uri'
require 'CSV'
require 'axlsx'

#Set encoding options to remove nasty Trademark symbols
  encoding_options = {
    :invalid           => :replace,  # Replace invalid byte sequences
    :undef             => :replace,  # Replace anything not defined in ASCII
    :replace           => '',        # Use a blank for those replacements
    :universal_newline => true       # Always break lines with \n
  }

doc = Nokogiri::HTML(open("http://h10010.www1.hp.com/wwpc/ie/en/ho/WF06b/321957-321957-3329742-89318-89318-5186820-5231694.html?dnr=1"))
#For each break create a ;
doc.css('br').each{ |br| br.replace ';' }

clues = Array.new
clues << 'Operating system'
clues << 'Processors'
clues << 'Chipset'
clues << 'Memory type'
clues << 'Hard drive'
clues << 'Graphics'
clues << 'Ports'
clues << 'Webcam'
clues << 'Pointing device'
clues << 'Keyboard'
clues << 'Network interface'
clues << 'Chipset'
clues << 'Wireless'
clues << 'Power supply type'
clues << 'Energy efficiency'
clues << 'Weight'
clues << 'Minimum dimensions (W x D x H)'
clues << 'Warranty'
clues << 'Software included'
clues << 'Product color'

CSV.open("output.csv", "wb") do |csv|
  #1. Output the Clues header
  #2. Scrape the output/force encoding to remove special characters
    csv << clues
    csv << clues.map{|clue| doc.at("//td[text()='#{clue}']/following-sibling::td").text.strip.encode Encoding.find('ASCII'), encoding_options}
  #end loop
end

My code can add the entire array to one line but how do I say foreach item in the array add it to a newline? I tried \n but it didn't work.

The Output I get

The output I get

My Desired output

My desired output

Community
  • 1
  • 1
Ninja2k
  • 819
  • 3
  • 9
  • 34

3 Answers3

1

this is randym, the author of axlsx. I think you want to be doing this:

clues = Array.new
clues << 'Operating system'
clues << 'Processors'
clues << 'Chipset'
clues << 'Memory type'

Axlsx::Package.new do |p|
  p.workbook do |wb|
    wb.add_worksheet do |sheet|
      clues.each { |clue| sheet.add_row [clue] }
    end
  end
  p.serialize 'My_Spreadsheet.xlsx'
end

As to your second question:

selector = "//td[text()='%s']/following-sibling::td"
data = clues.map do |clue| 
         xpath = selector % clue
         [clue, doc.at(xpath).text.strip]
       end

Then use

data.each { |datum| sheet.add_row datum }

when you build the worksheet

require 'rubygems'
require 'nokogiri'   
require 'open-uri'
require 'axlsx'

doc = Nokogiri::HTML(open("http://h10010.www1.hp.com/wwpc/ie/en/ho/WF06b/321957-321957-3329742-89318-89318-5186820-5231694.html?dnr=1"))
#For each break create a ;
doc.css('br').each{ |br| br.replace ';' }

clues = Array.new
clues << 'Operating system'
clues << 'Processors'
clues << 'Chipset'
clues << 'Memory type'
clues << 'Hard drive'
clues << 'Graphics'
clues << 'Ports'
clues << 'Webcam'
clues << 'Pointing device'
clues << 'Keyboard'
clues << 'Network interface'
clues << 'Chipset'
clues << 'Wireless'
clues << 'Power supply type'
clues << 'Energy efficiency'
clues << 'Weight'
clues << 'Minimum dimensions (W x D x H)'
clues << 'Warranty'
clues << 'Software included'
clues << 'Product color'

selector = "//td[text()='%s']/following-sibling::td"
data = clues.map do |clue| 
         xpath = selector % clue
         [clue, doc.at(xpath).text.strip]
       end

Axlsx::Package.new do |p|
  p.workbook.add_worksheet do |sheet|
    data.each { |datum| sheet.add_row datum }
  end
  p.serialize 'output.xlsx'
end

Screen shots for your pleasure. enter image description here

randym
  • 2,430
  • 1
  • 19
  • 18
  • Hi randym thanks for the reply, I am not a programmer so bear with me. I have edited my question so maybe you can see what I need to do. – Ninja2k Jul 31 '12 at 07:41
  • Ninja2k - you *are* a programmer! you are solving your problems using code. Ive updated my answer with a full example from your source code that will generate the sheet you are looking for. – randym Jul 31 '12 at 08:08
  • Brilliant!! thank you for that it work like a charm, this was the first ruby program I ever made! I am a Network Engineer by trade but needed to get some specs for a bunch of computers without the carpal tunnel. – Ninja2k Jul 31 '12 at 08:21
0

Use each to iterate over the array:

clues.each do |clue|
  sheet.add_row [clue]
end
Sam Peacey
  • 5,964
  • 1
  • 25
  • 27
  • How can I do this with something more complicated such as an Xpath query. My full code is at https://gist.github.com/3205801 – Ninja2k Jul 30 '12 at 09:14
  • One question per ummm question... ;) Also you might want to clarify what you want - just looking at your code, I don't know what's not working. – Sam Peacey Jul 31 '12 at 01:09
  • sorry tried to make it simple in the question, but made it too simple ;( updated it to include pictures of what I need and a little more info – Ninja2k Jul 31 '12 at 07:43
  • It's cool, looks like the author of the gem is here to help now, so I'm sure you couldn't be in better hands. ;) – Sam Peacey Jul 31 '12 at 08:16
0

Try something like:

CSV.open("output.csv", "wb") do |csv|
  clues.each do |clue|
    value = doc.at("//td[text()='#{clue}']/following-sibling::td").text.strip.encode Encoding.find('ASCII'), encoding_options
    cvs << [clue, value]
  end
end
taro
  • 5,772
  • 2
  • 30
  • 34