Removing/ignoring the unwanted text is not a Nokogiri problem, it's a String processing problem:
require 'nokogiri'
doc = Nokogiri::HTML(<<EOT)
li class="object-props-item price"
<strong>CHF 14'900.-</strong>
<li class="object-props-item milage">61'000 km</li>
EOT
str = doc.at('strong').text # => "CHF 14'900.-"
At this point str
contains the text of the <strong>
node.
A simple regex will extract, which is the straightforward way to grab the data:
str[/[\d']+/] # => "14'900"
sub
could be used to remove the 'CHF '
substring:
str.sub('CHF ', '') # => "14'900.-"
delete
could be used to remove the characters C
, H
, F
and
:
str.delete('CHF ') # => "14'900.-"
tr
could be used to remove everything that is NOT 0
..9
, '
, .
or -
:
str.tr("^0-9'.-", '') # => "14'900.-"
Modify one of the above if you don't want '
, .
or -
.
why are the data of the mileage not displaying
Because you have a mismatch between the CSS selector and the actual class
parameter:
require 'nokogiri'
doc = Nokogiri::HTML('<li class="object-props-item milage">61'000 km</li>')
doc.at('.mileage').text # =>
# ~> NoMethodError
# ~> undefined method `text' for nil:NilClass
# ~>
# ~> /var/folders/yb/whn8dwns6rl92jswry5cz87dsgk2n1/T/seeing_is_believing_temp_dir20160428-96035-1dajnql/program.rb:5:in `<main>'
Instead it should be:
doc.css('.milage').text # => "61'000 km"
But that's not all that's wrong. There's a subtle problem waiting to bite you later.
css
or search
returns a NodeSet whereas at
or at_css
returns an Element:
doc.css('.milage').class # => Nokogiri::XML::NodeSet
doc.at('.milage').class # => Nokogiri::XML::Element
Here's what happens when text
is passed a NodeSet containing multiple matching nodes:
doc = Nokogiri::HTML('<p>foo</p><p>bar</p>')
doc.search('p').class # => Nokogiri::XML::NodeSet
doc.search('p').text # => "foobar"
doc.at('p').class # => Nokogiri::XML::Element
doc.at('p').text # => "foo"
When text
is used with a NodeSet it returns the text of all nodes concatenated into a single string. This can make it really difficult to separate the text from one node from another. Instead, use at
or one of the at_*
equivalents to get the text from a single node. If you want to extract the text from each node individually and get an array use:
doc.search('p').map(&:text) # => ["foo", "bar"]
See "How to avoid joining all text from Nodes when scraping" also.
Finally, notice that your HTML sample isn't valid:
doc = Nokogiri::HTML(<<EOT)
li class="object-props-item price"
<strong>CHF 14'900.-</strong>
<li class="object-props-item milage">61'000 km</li>')
EOT
puts doc.to_html
# >> <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
# >> <html><body>
# >> <p>li class="object-props-item price"
# >> <strong>CHF 14'900.-</strong>
# >> </p>
# >> <li class="object-props-item milage">61'000 km</li>')
# >> </body></html>
Here's what happens:
doc = Nokogiri::HTML(<<EOT)
li class="object-props-item price"
<strong>CHF 14'900.-</strong>
<li class="object-props-item milage">61'000 km</li>')
EOT
doc.at('.price') # => nil
Nokogiri has to do a fix-up to make sense of the first line, so it wraps it in <p>
. By doing so the .price
class no longer exists so your code will fail again.
Fixing the tag results in a correct response:
doc = Nokogiri::HTML(<<EOT)
<li class="object-props-item price">
<strong>CHF 14'900.-</strong>
</li>
<li class="object-props-item milage">61'000 km</li>')
EOT
doc.at('.price').to_html # => "<li class=\"object-props-item price\">\n<strong>CHF 14'900.-</strong>\n</li>"
This is why it's really important to make sure your input is valid. Trying to duplicate your problem is difficult without it.