How do I scrape data through Mechanize and Nokogiri?

Question

I am working on an application which gets the HTML from http://www.screener.in/.

I can enter a company name like "Atul Auto Ltd" and submit it and, from the next page, scrape the following details: "CMP/BV" and "CMP".

I am using this code:

require 'mechanize'
require 'rubygems'
require 'nokogiri'

Company_name='Atul Auto Ltd.'
agent = Mechanize.new
page = agent.get('http://www.screener.in/')
form = agent.page.forms[0]
print agent.page.forms[0].fields
agent.page.forms[0]["q"]=Company_name
button = agent.page.forms[0].button_with(:value => "Search Company")
pages=agent.submit(form, button)
puts pages.at('.//*[@id="top"]/div[3]/div/table/tbody/tr/td[11]')
# not getting any output.

The code is taking me to the right page but I am don't know how to query to get the required data.

I tried different things but was unsuccessful.

If possible, can someone point me towards a nice tutorial which explains how to scrape a particular class from an HTML page. The XPath of the first "CMP/BV" is:

//*[@id="top"]/div[3]/div/table/tbody/tr/td[11]

but it is not giving any output.

Arup Rakshit · Accepted Answer · 2013-07-21T19:01:51.217

3

Using Nokogiri I would go as below:

Using CSS Selectors

require 'nokogiri'
require 'open-uri'

doc = Nokogiri::HTML(open('http://www.screener.in/company/?q=Atul+Auto+Ltd.'))

doc.class
# => Nokogiri::HTML::Document
doc.css('.table.draggable.table-striped.table-hover tr.strong td').class
# => Nokogiri::XML::NodeSet

row_data = doc.css('.table.draggable.table-striped.table-hover tr.strong td').map do |tdata|
  tdata.text
end

 #From the webpage I took the below value from the table 
 #*Peer Comparison Top 7 companies in the same business*    

row_data
# => ["6.",
#     "Atul Auto Ltd.",
#     "193.45",
#     "8.36",
#     "216.66",
#     "3.04",
#     "7.56",
#     "81.73",
#     "96.91",
#     "17.24",
#     "2.92"]

Looking at the table from the webpage I can see CMP/BV and CMP are the twelfth and third columns respectively. Now I can get the data from the array row_data. So CMP is the second index and CMP/BV is the last value of the array row_data.

row_data[2] # => "193.45" #CMP
row_data.last # => "2.92" #CMP/BV

Using XPATH

require 'nokogiri'
require 'open-uri'

doc = Nokogiri::HTML(open('http://www.screener.in/company/?q=Atul+Auto+Ltd.'))

p doc.at_xpath("//*[@id='peers']/table/tbody/tr[6]/td[3]").text
p doc.at_xpath("//*[@id='peers']/table/tbody/tr[6]/td[10]").text
# >> "193.45" #CMP
# >> "17.24"  #CMP/BV

edited Jul 21 '13 at 19:01

answered Jul 20 '13 at 15:24

Arup Rakshit

116,827
30
260
317

It works can u tell me , how you are able to find out css selector? . I traced using chrome Developers tools , so is there any tools for this and how i can do this using xpath. – Deepender Singla Jul 20 '13 at 15:36
I use firefox add-on *Firebug*. – Arup Rakshit Jul 20 '13 at 15:37
any idea how to do it with xpath. – Deepender Singla Jul 20 '13 at 15:39
@DeependerSingla Use CSS rules.. its always easy and good to see. :) see my update. – Arup Rakshit Jul 20 '13 at 15:41
@okey still there is no harm in trying. – Deepender Singla Jul 20 '13 at 16:17
3

CSS isn't always the easiest form of a selector but they are usually easier to read. There are times you have to use XPath because it's more full-featured. – the Tin Man Jul 20 '13 at 19:34
@theTinMan Yes.. you are right.. But if there is a chance to use *CSS Selectors*,I use it there. As it is more readable than `xpath`. – Arup Rakshit Jul 20 '13 at 19:35
@Priti Please don't offer the `xpath_for` output as suggested XPath. It tends to produce unreadable expressions because they are a literal translation of the CSS, whereas idiomatic XPath will be shorter and cleaner. – Mark Thomas Jul 21 '13 at 15:44
@priti can you tell me how you got this xpath ? I am using google devlopers tools i am getting this xpath //*[@id="top"]/div[3]/div/table/tbody/tr/td[10] it is not giving any result and your xpath for same is //*[@id='peers']/table/tbody/tr[6]/td[10] .I am a beginner with xpath and CSS if you can help , through what way we can get xpath and CSS path. – Deepender Singla Jul 22 '13 at 07:21
1

@DeependerSingla I created the `xpath` expression by doing view source on the webpage. Still I would recommend to use CSS Selectors,as you are just started with Nokogiri. Once you willl be familiard with CSS selectors use .. It will slowly help you to buld the `xpath` expression too.. I would like to say you to look into the `Firebug` add-on of Firefox for the same. – Arup Rakshit Jul 22 '13 at 13:16
Mechanize includes Nokogiri, so you can load the website with Mechanize agent, then do: Nokogiri::HTML(website.body) to read. Why use Mechanize? because you can save sessions, avoid robots.txt and some nice things more. – José Castro Dec 17 '13 at 20:43

How do I scrape data through Mechanize and Nokogiri?

1 Answers1

Linked