Mechanize uses URI strings to point to what it's supposed to parse. Normally we'd use a "http
" or "https
" scheme to point to a web-server, and that's where Mechanize's strengths are, but other schemes are available, including "file
", which can be used to load a local file.
I have a little HTML file on my Desktop called "test.rb":
<!DOCTYPE html>
<html>
<head></head>
<body>
<p>
Hello World!
</p>
</body>
</html>
Running this code:
require 'mechanize'
agent = Mechanize.new
page = agent.get('file:/Users/ttm/Desktop/test.html')
puts page.body
Outputs:
<!DOCTYPE html>
<html>
<head></head>
<body>
<p>
Hello World!
</p>
</body>
</html>
Which tells me Mechanize loaded the file, parsed it, then accessed the body
.
However, unless you need to actually manipulate forms and/or navigate pages, then Mechanize is probably NOT what you want to use. Instead Nokogiri, which is under Mechanize, is a better choice for parsing, extracting data or manipulating the markup and it's agnostic as to what scheme was used or where the file is actually located:
require 'nokogiri'
doc = Nokogiri::HTML(File.read('/Users/ttm/Desktop/test.html'))
puts doc.to_html
which then output the same file after parsing it.
Back to your question, how to find the node only using Nokogiri:
Changing test.html
to:
<!DOCTYPE html>
<html>
<head></head>
<body>
<div class="product_name">Hello World!</div>
</body>
</html>
and running:
require 'nokogiri'
doc = Nokogiri::HTML(File.read('/Users/ttm/Desktop/test.html'))
doc.search('div.product_name').map(&:text)
# => ["Hello World!"]
shows that Nokogiri found the node and returned the text.
This code in your sample could be better:
text = node.text
puts "product name: " + text.to_s
node.text
returns a string:
doc = Nokogiri::HTML('<p>hello world!</p>')
doc.at('p').text # => "hello world!"
doc.at('p').text.class # => String
So text.to_s
is redundant. Simply use text
.