How do I specify XPATH or CSS in Nokogiri to scrape a page's table data?

Question

I'm trying to scrape a page with financial data using Nokogiri and Ruby 1.9.3.

I'm having trouble getting the right XPath or CSS filter to get the table that holds the data, then iterate through the data and assemble it so the output can be put into a CSV file like this:

Date, Company,Symbol,ReportedEPS,Consensus EPS  
20130828,CDN WESTERN BANK,CWB.TO,0.60,0.59

I used Firebug to get the XPath and CSS data. What is the correct format for XPath or CSS to extract the table then iterate through the lines to assemble them for output to a file?

require 'rubygems'
require 'mechanize'
require 'nokogiri'
require 'uri'

@agent = Mechanize.new do|a|    
  a.user_agent_alias = "Windows IE 6"
end

url = "http://biz.yahoo.com/z/20130828.html"
page = @agent.get(url)
doc = Nokogiri::HTML(page.body)
puts doc.inspect 

#~ from firebug
#~ xpath        /html/body/p[3]/table/tbody
#~ css      html body p table tbody

score 2 · Answer 1 · answered Nov 26 '13 at 04:06

Some browsers will add a <tbody> to a <table> while they're parsing/validating/fixing the incoming HTML. Firefox is one of those browsers. The XPath and CSS expressions that you're getting out of Firefox are for the HTML as Firefox sees it and that's not necessarily the HTML as Nokogiri will see it.

Drop the <tbody> and try this XPath:

/html/body/p[3]/table

to locate the table. You can also look at the raw HTML and see if there is an id attribute or class attribute on the table that you can use with CSS id (#the-id) or class (.the-class) selectors instead of a large path of elements.

score 1 · Accepted Answer · edited May 23 '17 at 12:02

I generally use CSS over XPath, for readability. This is something like I'd use:

require 'open-uri'
require 'nokogiri'

URL = "http://biz.yahoo.com/z/20130828.html"
doc = Nokogiri::HTML(open(URL))
table = doc.css('table')[4]

data = table.search('tr')[2..-1].map { |row|
  row.search('td').map(&:text)
}

data
# => [["CDN WESTERN BANK",
#      "CWB.TO",
#      "1.69",
#      "0.60",
#      "0.59",
#      "N/A",
#      "Quote, Chart, News, ProfileReports, Research"],
#     ["Casella Waste Systems, Inc.",
#      "CWST",
#      "71.43",
#      "-0.02",
#      "-0.07",
#      "N/A",
#      "Quote, Chart, News, ProfileReports, Research, Msgs, Insider, Analyst Ratings"],
#     ["Culp, Inc. Common Stock",
#      "CFI",
#      "5.56",
#      "0.38",
#      "0.36",
#      "Listen",
#      "Quote, Chart, News, ProfileReports, Research, Msgs, Insider, Analyst Ratings"],

There's a lot more data returned, but that's sufficient to show what the code is grabbing.

It's not at all necessary to use Mechanize for this task. Unless you need to navigate through a site, Mechanize isn't helping you very much, so I'd go with OpenURI.

See "How to avoid joining all text from Nodes when scraping" also.

Exactly what I want. Thank you. – user2720047 Nov 27 '13 at 03:35 — user2720047, Nov 27 '13 at 03:35

How do I specify XPATH or CSS in Nokogiri to scrape a page's table data?

2 Answers2