0

I am trying to scrape a Google search page to learn scraping, using code like this:

doc = Nokogiri::HTML(open("https://www.google.com/search?q=cardiovascular+diesese"))

I want to get the result statistics text in every search page:

result-stat

but I can't find the position of the content in the parsed HTML. I can inspect the page in the browser and see it's in a <div id="result-stats">. I tried this to find it:

doc.at_css('[id="result-stats"]').text
the Tin Man
  • 158,662
  • 42
  • 215
  • 303
Rafayet Monon
  • 1,019
  • 8
  • 16
  • 1
    What you are doing is a most likely a violation of Google's TOS. The correct way to do this is by using the [custom search API](https://developers.google.com/custom-search/v1/overview) or one their other APIs. Webscrapers are usually the wrong answer to any problem and are extremely brittle. – max Mar 20 '20 at 13:55
  • Thanks for your suggestion but not using it for any real life project. Just for education purpose. – Rafayet Monon Mar 20 '20 at 14:17
  • Google uses a lot of DHTML; The TOS specifically rules out scraping, and they supply a very good API as an alternate means of gathering their data. Nokogiri doesn't handle DHTML, Ajax, JavaScript or JSON, so you have to rely on tools that can understand JavaScript and DHTML and once the page has loaded and settled down, _then_ grab the HTML and pass it to Nokogiri. Inspecting in browser is the last thing to trust, especially when trying to scrape a page with Nokogiri. Browsers will cover all sorts of issues that affect an XML/HTML parser. Use `curl`, `wget` or `nokogiri` at the command-line. – the Tin Man Mar 20 '20 at 19:56
  • The node you're looking for is loaded using JavaScript. To tell, turn off JavaScript in your browser and reload the page, or, use `nokogiri` at the command-line, then search for that node in the returned `@doc`. How to scrape JavaScript pages is an entirely different question. – the Tin Man Mar 21 '20 at 21:00
  • Use "[Google Custom Search](https://developers.google.com/custom-search/docs/tutorial/introduction)" instead. – the Tin Man Mar 21 '20 at 21:13

2 Answers2

2

Your use of CSS is awkward. Consider this:

require 'nokogiri'

doc = Nokogiri::HTML(<<EOT)
<html>
  <body>
    <div id="result-stats">foo</div>
  </body>
</html>
EOT

doc.at_css('[id="result-stats"]').text # => "foo"
doc.at('#result-stats').text # => "foo"

CSS uses # for id, so '[id="result-stats"]' is unnecessarily verbose.

Nokogiri is smart enough to know to use CSS when it looks at the selector; In many years of using it I've only fooled it once and was forced to use the CSS/XPath specific versions of the generic search or at methods. By using the generic methods you can change the selector between CSS and XPath without bothering with the method being called. "Using 'at', 'search' and their siblings" talks about this.

In addition, just for fun, Nokogiri should have all the jQuery extensions to CSS as those were on the v2.0 roadmap for Nokogiri.

the Tin Man
  • 158,662
  • 42
  • 215
  • 303
1

You need to use Selenium WebDriver to get dynamic content. Nokogiri alone cannot parse it.

require 'selenium-webdriver'

driver = Selenium::WebDriver.for :firefox
driver.get "https://www.google.com/search?q=cardiovascular+diesese"
doc = Nokogiri::HTML driver.page_source
doc.at_css('[id="result-stats"]').text
the Tin Man
  • 158,662
  • 42
  • 215
  • 303
Rafayet Monon
  • 1,019
  • 8
  • 16