Is there a better way to extract the visible text on a web page using Nokogiri? Currently I use the inner_text
method, however that method counts a lot of JavaScript as visible text. The only text I want to capture is the visible text on the screen.
For example, in IRB if I do the following in Ruby 1.9.2-p290:
require 'nokogiri'
require 'open-uri'
doc = Nokogiri::HTML(open("http://www.bodybuilding.com/store/catalog/new-products.jsp?addFacet=REF_BRAND:BRAND_MET_RX"))
words = doc.inner_text
words.scan(/\w+/)
If I search for the word "function" I see that it appears 20 times in the list, however if I go to http://www.bodybuilding.com/store/catalog/new-products.jsp?addFacet=REF_BRAND:BRAND_MET_RX the word "function" does not appear anywhere in the visible text.
Can I ignore JavaScript or is there a better way of doing this?