1

I am trying to scrape a webpage with Mechanize, with the following structure:

<div id="searchResultsBox">  
    <div class="listings-wrap">
      <div class="listings-header">
        <div class="listing-cat">Category</div>
        <div class="listing-name">Name</div>
      </div>
      <ul class="listings">
        <li class="listing">
          <a href="/ShowRatings.jsp?tid=1143052">
            <span class="listing-cat">
              <span class="icon"></span>
              TEXT
              </span>
            <span class="listing-name">
              <span class="main">TEXT</span>
              <span class="sub">TEXT</span>
            </span>
          </a>
        </li>
         ...

I want to navigate to the page behind the <a> HTML element. Right now, I have:

agent = Mechanize.new
page = agent.get("URL")
page = page.at('#searchResultsBox > div.listings-wrap > ul > li:nth-child(1) > a')

but it keeps returning NIL (verified by puts page.class).

I also tried using sleep to try to ensure that pages have time to load before continuing.

Is there anything I am doing wrong? I thought using the CSS selector would do the trick.

the Tin Man
  • 158,662
  • 42
  • 215
  • 303
  • How did you get the HTML? If Mechanize can't find that tag either the selector is wrong or the tag doesn't actually exist in Mechanize's world. Adding `sleep` won't help because Mechanize grabs the page then waits for you to tell it what element to find; It doesn't recursively walk the page and retrieve everything like a browser would, which also means if sections of the page are loaded dynamically then Mechanize will never see them. Use `nokogiri` at the command-line to load the page then use `@doc.at('#searchResultsBox > div.listings-wrap > ul > li:nth-child(1) > a')` and see if it works. – the Tin Man Sep 27 '16 at 23:52
  • try following `page.at('div#searchResultsBox a')` – Santosh Sharma Sep 28 '16 at 10:34

1 Answers1

0

Maybe the website content is loaded dynamically, by JavaScript.

Inspect the content of your page variable and see if the content there is complete or not.

If the content is incomplete, it means that there has to be some other requests, to the serwer returning that data. You can search for them opening Chrome DevTools (or other tool). In the tab "Network" you will see all requests made by website. Search for the one containing data that you need and then scrape it by Mechanize.

maicher
  • 2,625
  • 2
  • 16
  • 27