Ruby Mechanize Grab HTML Behind Links in Array

Question

I'm using Mechanize to grab a bunch of pages behind links. On page A, there's a bunch of companies that have a link called "[complete profile]", behind which is full html I want to grab. There are 10 of these on page A. I can't seem to traverse the links and save them into an array, then use them later. So, I might as well iterate through each link, grab the url and the company HTML at the same time. I was planning on storing them and resorting to them, but they are not full links and I don't know how it is done.

Anyway, this is what I currently have:

companyobjects = agent.page.links_with(:text => '[complete profile]')
companylinks = []

 companyobjects.each do |i|
   companylinks.push(i)
   # -> Shove each company's html into the db
   page = agent.i.href.click
   puts
   puts page
 end

The page = agent.i.href.click is where things go wrong. 'i' should be an individual company, so asking for its internal link and clicking on it should get the page, but it's not getting past "method 'i'" for some reason.

Anybody know how to grab found links and grab the html behind them? I'm lost. Any input appreciated.

Cheers

ihaztehcodez · Answer 1 · 2015-01-20T23:33:54.823

0

If you want to iterate over each link and visit the page it links to, this should work:

agent.page.links_with(text: '[complete profile]').each do |link|
  link.click
  puts agent.page.body # prints HTML that corresponds to link

  # OR

  page = link.click
  html = page.body
  # do something with html
end

If you want to collect the full URL for each link, this should work:

links = []
agent.page.links_with(text: '[complete profile]').each do |link|
  links << URI.join(agent.page.uri, link.href).to_s
end

edited Jan 20 '15 at 23:33

answered Jan 20 '15 at 11:20

ihaztehcodez

2,123
15
29

That first iteration, how can I grab the html behind each link though? That link.click comes back to me with a Mechanize object with links and forms, etc., in that page. I'm looking to grab the full html. I tried get.link.click but that threw an error. Cheers – Rich_F Jan 20 '15 at 21:51
Think of it like you are driving a browser. After you call `link.click`, `agent.page` is now the page that corresponds to that link. I will update my answer to demonstrate. – ihaztehcodez Jan 20 '15 at 23:25
Mechanize is a complicated beast, but it can be tamed. Ruby's introspection and mechanize's documentation are your friend here. For instance `link.click` returned an object, but you weren't sure what to do with it. So use ruby's introspection to your advantage: `mystery_object = link.click; puts mystery_object.class.to_s`. Now you know `mystery_object` is an instance of `Mechanize::Page` and you can check out the [documentation](http://www.rubydoc.info/gems/mechanize/Mechanize/Page) to see what you can do with it. This approach helped me learn my way around mechanize. – ihaztehcodez Jan 21 '15 at 00:12
Ya that's what I kept doing was checking the class. It kept saying Mechanize, and I wanted to grab parts of the resulting links and forms, but also jump into nokogiri territory and grab the whole page (mo.body). The documentation is a bit fuzzy here. I'm still trying to get my head around the resources for pages like rubydoc. I'm one of those guys expecting proper documentation. Heh. Hey thanks for the input. – Rich_F Jan 21 '15 at 00:16

Ruby Mechanize Grab HTML Behind Links in Array

1 Answers1