0

I am using Watir Webdriver with headless on a Linux system running Firefox and I am having some speed issues extracting links from webpages. The problem seems to be when multiple frames are being used. For example it cane take 10 minutes to return all the links on www.cnet.com.

Why is it taking this long and is there anything I can do to speed it up?

For example, these are some typical timings I took. It takes aprrox 8 seconds to get all links from the "default frame" but then 20 seconds to get those from a frame:

No Frame: 8.304341236
Frame: 20.050233141
Frame: 20.070569295
....

In fact in this case none of the frames actually contain any links. (See this issue I raised about skipping certain frames Watir-Webdriver Frame Attributes Not Congurent with Other Sources)

The code to extract the links from the page is as follows:

b.links.each do |uri|
  # Check the HREF doesn't meet any of the following conditions. We don't want these so we ignore them.
  if uri.href != nil and uri.href != "" and uri.href[0,7].downcase != "mailto:" and uri.href[0,11].downcase != "javascript:"
    if debug
      puts " [x] [" + Process.pid.to_s + "] Discovered (noframe) URL: " + uri.href
    end
    # Add the discovered HREF to the array
    href.push(uri.href)
  end
end

The code to used to extract links from the frames is as follows:

b.frames.each do |frame|
  frame.links.each do |uri|
    if uri.href != nil and uri.href != "" and uri.href[0,7].downcase != "mailto:" and uri.href[0,11].downcase != "javascript:"
      if debug
        puts " [x] [" + Process.pid.to_s + "] Discovered Frame URL: " + uri.href
      end
      # Add the discovered HREF to the array
      href.push(uri.href)
    end
  end
end

Any help would be appreciated.

Community
  • 1
  • 1
Matt S
  • 33
  • 1
  • 6

1 Answers1

0

I think I found the source of the problem but not the actual root cause of the problem.

Earlier on in my code I am setting the following value for the timeout:

b.driver.manage.timeouts.implicit_wait = 20

If I set this to say 3 seconds then my code runs significantly quicker.

That said, why is it waiting the timeout value?

Test results from another site:

Timeout = 3
No Frame: 8.492559438
Frame: 3.037607356
Frame: 0.21291884
Frame: 0.187332136
Total: 27.3930574

Timeout = 20
No Frame: 8.698615854
Frame: 20.039797232
Frame: 0.202382168
Frame: 0.192850861
Total: 44.221886117

I am wondering if there is a bug. If it cannot find the element you are looking for it seems to take the entire timeout value to return.

Please note, I know the total does not add up because I am just measuring time between certain lines of code. Total is just how long it takes to run from start to finish where as the other times are between loops.

Matt S
  • 33
  • 1
  • 6