1

I am trying to scrape and make a CSV file from this HTML:

<ul class="object-props">
                <li class="object-props-item price">
                    <strong>CHF 14&#39;800.-</strong>
                </li>
                <li class="object-props-item milage">31&#39;000 km</li>
                <li class="object-props-item date">08.2012</li>
            </ul>

I want to extract the price and mileage using:

require 'rubygems'
require 'nokogiri'
require 'CSV'
require 'open-uri'

url= "/tto.htm"
data = Nokogiri::HTML(open(url))

CSV.open('csv.csv', 'wb') do |csv|
  csv << %w[ price mileage ]

  price=data.css('.price').text
  mileage=data.css('.mileage').text 

  csv << [price, mileage]
end

The result is not really what I'm expecting. Two columns are created, but how can I remove the characters like CHF and KM and why is the data of the mileage not displaying result?

r tremeaud
  • 161
  • 1
  • 3
  • 11
  • I don't think it's the reason for your problem, but you are opening the file for writing in binary mode (`wb`). CSV is a text representation so I'm pretty sure you should be opening it in text mode (`w`). – Keith Bennett Apr 28 '16 at 10:25
  • When asking about a problem with your code, we need to see your attempt at solving the problem. In your code you don't show where you're trying to remove the CHF and KM data; Please add that. Without that it looks like you're asking us how to write your code, which isn't what SO is for. Also, your "result" shouldn't be a link to an image. Instead, present that information in the question itself. "[mcve]" describes what we need. To remove that information use `delete` or `sub` on the retrieved text or better, use `tr` to remove what you don't want or write a regex to extract only what you want. – the Tin Man Apr 28 '16 at 18:48
  • Also, your input HTML isn't valid. See my answer for an explanation why that's important. – the Tin Man Apr 28 '16 at 19:59

2 Answers2

0

My guess is that the text in the HTML includes units of measure; CHF for Swiss Francs for the price, and km for kilometers for the mileage.

You could add split.first or split.last to get the number without the unit of measure, e.g.:

2.3.0 :007 > 'CHF  100'.split.last
 => "100"
2.3.0 :008 > '99 km'.split.first
 => "99"
Keith Bennett
  • 4,722
  • 1
  • 25
  • 35
  • OK, think you for the tipp, but when i apply i guess i miss something because: `price.split.last ` return me only the first value in my .csv .Anyway thank for the quick response, I did not expect such a reactivity – r tremeaud Apr 28 '16 at 11:31
  • `price=data.css('.price').text` and `price=price.split.last` in order to apply your suggestion. – r tremeaud Apr 28 '16 at 11:45
  • Try this: `price=data.css('.price').text.split.last` – Keith Bennett Apr 28 '16 at 11:46
  • the result gives only my first value, but the format is correct, I think it is close ... – r tremeaud Apr 28 '16 at 11:50
  • 'price = data.css('.price').text.split.last' output is my .csv with just the first data ... – r tremeaud Apr 28 '16 at 12:17
  • What do you mean by 'first data'? – Keith Bennett Apr 28 '16 at 12:24
  • If you look up, in my question you will see a png file showing my output, with your suggestion i just have the first value but without the CHF which is fine.However I want to have the other data contained in the page (thank you) – r tremeaud Apr 28 '16 at 12:37
  • Sorry, that may be my fault. I think I changed the spelling from 'milage' to 'mileage' in my edit. You will probably either need to correct the spelling in your app, or revert the symbol to :milage (`mileage=data.css('.milage').text`) – Keith Bennett Apr 28 '16 at 12:48
  • i'm not sure to understand, what does it have to do with my problem ? – r tremeaud Apr 28 '16 at 12:54
  • I assumed that because you asked about removing the 'KM' that you had gotten the data before, but if I'm wrong, then you're right, it's not related. However, your problem seems to be the absence of data, and that can be caused by looking for a nonexistent key (though I don't know how Nokogiri behaves in that case). I thought it might be a bad key name, i.e. the parameter passed to `data.css()`. – Keith Bennett Apr 28 '16 at 12:57
  • no I think this is a missing option in my request because in my first example I get the data, thank you for your quick help – r tremeaud Apr 28 '16 at 13:59
  • Please read http://meta.stackoverflow.com/q/297597/128421. Rather than ask the OP to accept your answer, which isn't a good idea until many hours, or a day, has passed, suggest they read the final paragraph in http://stackoverflow.com/help/someone-answers as a comment to the question itself. – the Tin Man Apr 28 '16 at 19:30
  • I am deleting the comment you referred to ("If you believe I answered your question, please accept my answer by clicking the check mark."), and will post another one later, something like: If you're new to StackOverflow, please read http://stackoverflow.com/help/someone-answers. Is that acceptable? – Keith Bennett Apr 28 '16 at 20:08
0

Removing/ignoring the unwanted text is not a Nokogiri problem, it's a String processing problem:

require 'nokogiri'

doc = Nokogiri::HTML(<<EOT)   
li class="object-props-item price"
<strong>CHF 14&#39;900.-</strong>
<li class="object-props-item milage">61&#39;000 km</li>
EOT

str = doc.at('strong').text # => "CHF 14'900.-"

At this point str contains the text of the <strong> node.

A simple regex will extract, which is the straightforward way to grab the data:

str[/[\d']+/] # => "14'900"

sub could be used to remove the 'CHF ' substring:

str.sub('CHF ', '') # => "14'900.-"

delete could be used to remove the characters C, H, F and :

str.delete('CHF ') # => "14'900.-"

tr could be used to remove everything that is NOT 0..9, ', . or -:

str.tr("^0-9'.-", '') # => "14'900.-"

Modify one of the above if you don't want ', . or -.

why are the data of the mileage not displaying

Because you have a mismatch between the CSS selector and the actual class parameter:

require 'nokogiri'

doc = Nokogiri::HTML('<li class="object-props-item milage">61&#39;000 km</li>')

doc.at('.mileage').text  # => 
# ~> NoMethodError
# ~> undefined method `text' for nil:NilClass
# ~>
# ~> /var/folders/yb/whn8dwns6rl92jswry5cz87dsgk2n1/T/seeing_is_believing_temp_dir20160428-96035-1dajnql/program.rb:5:in `<main>'

Instead it should be:

doc.css('.milage').text # => "61'000 km"

But that's not all that's wrong. There's a subtle problem waiting to bite you later.

css or search returns a NodeSet whereas at or at_css returns an Element:

doc.css('.milage').class # => Nokogiri::XML::NodeSet
doc.at('.milage').class # => Nokogiri::XML::Element

Here's what happens when text is passed a NodeSet containing multiple matching nodes:

doc = Nokogiri::HTML('<p>foo</p><p>bar</p>')

doc.search('p').class # => Nokogiri::XML::NodeSet
doc.search('p').text # => "foobar"

doc.at('p').class # => Nokogiri::XML::Element
doc.at('p').text # => "foo"

When text is used with a NodeSet it returns the text of all nodes concatenated into a single string. This can make it really difficult to separate the text from one node from another. Instead, use at or one of the at_* equivalents to get the text from a single node. If you want to extract the text from each node individually and get an array use:

doc.search('p').map(&:text) # => ["foo", "bar"]

See "How to avoid joining all text from Nodes when scraping" also.

Finally, notice that your HTML sample isn't valid:

doc = Nokogiri::HTML(<<EOT)
li class="object-props-item price"
<strong>CHF 14&#39;900.-</strong>
<li class="object-props-item milage">61&#39;000 km</li>')
EOT

puts doc.to_html

# >> <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
# >> <html><body>
# >> <p>li class="object-props-item price"
# >> <strong>CHF 14'900.-</strong>
# >> </p>
# >> <li class="object-props-item milage">61'000 km</li>')
# >> </body></html>

Here's what happens:

doc = Nokogiri::HTML(<<EOT)
li class="object-props-item price"
<strong>CHF 14&#39;900.-</strong>
<li class="object-props-item milage">61&#39;000 km</li>')
EOT

doc.at('.price') # => nil

Nokogiri has to do a fix-up to make sense of the first line, so it wraps it in <p>. By doing so the .price class no longer exists so your code will fail again.

Fixing the tag results in a correct response:

doc = Nokogiri::HTML(<<EOT)
<li class="object-props-item price">
<strong>CHF 14&#39;900.-</strong>
</li>
<li class="object-props-item milage">61&#39;000 km</li>')
EOT
doc.at('.price').to_html # => "<li class=\"object-props-item price\">\n<strong>CHF 14'900.-</strong>\n</li>"

This is why it's really important to make sure your input is valid. Trying to duplicate your problem is difficult without it.

Community
  • 1
  • 1
the Tin Man
  • 158,662
  • 42
  • 215
  • 303
  • i changed my input, but i can't put all the code here because too big, thinks for your help – r tremeaud Apr 29 '16 at 16:23
  • We don't want all your code. "[mcve]" says you should use the minimum code and input necessary to demonstrate the problem. The links at the bottom of "[ask]" explain the reasoning behind those requirements, which are basically to teach debugging prior to asking here. Often, after doing that, people find the solution to the problem without having to ask. – the Tin Man Apr 29 '16 at 18:28