Scrape Web Pages in Batch Mode with Groovy Geb Library

Question

I'd like to scrape some Web pages given their urls. The Geb library is claimed to be capable of screen scraping.

What I have got so far is use the Browser.drive method and specify one page url in the method. That way I've been able to scrape data from that specific page. But each time I have to change the url manually to another page. When I run the script again, a new browser will open and it takes quite a while. I don't need the browser to open and only need to scrape data from the page. I believe there must be some mechanism for me to do data scraping in a batch mode for all my Web pages. I have read through The Book Of Geb several times, but still can't find any discussion on how to do it.

What do you mean with 'each time I have to change the url manually'? Why won't 'go' work for you to point to a different page? Maybe something like described here? http://desmontandojava.blogspot.nl/2012/06/scraping-with-groovy-ii-geb.html — Erik Pragt, Mar 26 '13 at 08:52
Do you want to scrape the *screen* or the *data*? Do you want to navigate explicitly or recursively scrape pages by following their links? — Peter Niederwieser, Mar 26 '13 at 10:21
@ErikPragt I have a lot of pages(which are independent urls) to process to get visual information of some elements in the pages. I can do it very simply using Java. For example, I can define a method that takes in page url and output the information and then just use a for loop to iterate through the list of urls to be processed. How can I accomplish this using Geb? — Terry Li, Mar 26 '13 at 15:00
@PeterNiederwieser I was just saying using Java as an example that I'd like to have a method to process page and have a for loop to go over all pages. From all examples I've seen in the book, I can't get an idea how to do that. Geb has the function that I need, but with “go”, I'm only able to process one page each time and change the url manually in the program to do another page. — Terry Li, Mar 26 '13 at 15:16
Well, why not make a loop, and for each item in the loop, call the 'to' method? Something like: pages.each { page -> to page }? It's mostly just Groovy code. — Erik Pragt, Mar 26 '13 at 15:25
@ErikPragt Can I just use Browser.drive once and write the for loop in side it? It seems that every time I run Browser.drive it opens a browser window and every time I run "go" it opens a new tab. I don't want these to happen. All I want is process all the pages and get information from them. Is that possible? — Terry Li, Mar 26 '13 at 16:00

Scrape Web Pages in Batch Mode with Groovy Geb Library

0 Answers0