0

I am trying to extract data for a class project from a webpage (a page that shows search results). Specifically, it's this page:

http://www.target.com/c/xbox-one-games-video/-/N-55krw#navigation=true&category=55krw&searchTerm=&view_type=medium&sort_by=bestselling&faceted_value=&offset=60&pageCount=60&response_group=Items&isLeaf=true&parent_category_id=55kug&custom_price=false&min_price=from&max_price=to

I just want to extract the titles of the products.

I'm using the following code:

final WebClient webClient = new WebClient(BrowserVersion.CHROME);
webClient.getOptions().setThrowExceptionOnScriptError(false);
webClient.getOptions().setJavaScriptEnabled(true);
webClient.setAjaxController(new NicelyResynchronizingAjaxController());
final HtmlPage page = webClient.getPage(itemPageURL);
int tries = 20;  // Amount of tries to avoid infinite loop
while (tries > 0) {
    tries--;
    synchronized(page) {
       page.wait(2000);  // How often to check
    }
}
int numThreads = webClient.waitForBackgroundJavaScript(1000000l);

PrintWriter pw = new PrintWriter("test-target-search.txt");
pw.println(page.asXml());
pw.close();

The page that results does not have the product information that's shown on the web browser. I imagine the AJAX calls haven't completed? (not sure though.)

Any help would greatly be appreciated. Thanks!

1 Answers1

0

You can use GET requests for such task. Control the page by the "pageCount" and "offset" argument in the URL, after retrieving the page (the example below does this for one page) you can use regex or whatever the content is in (JSON?) to extract the titles.

public static void main(String[] args)
{
    try
    {
        WebClient webClient = new WebClient();

        URL url = new URL(
                "http://tws.target.com/searchservice/item/search_results/v1/by_keyword?callback=getPlpResponse&navigation=true&category=55krw&searchTerm=&view_type=medium&sort_by=bestselling&faceted_value=&offset=60&pageCount=60&response_group=Items&isLeaf=true&parent_category_id=55kug&custom_price=false&min_price=from&max_price=to");
        WebRequest requestSettings = new WebRequest(url, HttpMethod.GET);

        requestSettings.setAdditionalHeader("Accept", "*/*");
        requestSettings.setAdditionalHeader("Content-Type", "application/x-www-form-urlencoded; charset=UTF-8");
        requestSettings.setAdditionalHeader("Referer", "http://www.target.com/c/xbox-one-games-video/-/N-55krw");
        requestSettings.setAdditionalHeader("Accept-Language", "en-US,en;q=0.8");
        requestSettings.setAdditionalHeader("Accept-Encoding", "gzip,deflate,sdch");
        requestSettings.setAdditionalHeader("Accept-Charset", "ISO-8859-1,utf-8;q=0.7,*;q=0.3");

        Page page = webClient.getPage(requestSettings);

        System.out.println(page.getWebResponse().getContentAsString());
    }
    catch (Exception e)
    {
        e.printStackTrace();
    }
}
Arya
  • 8,473
  • 27
  • 105
  • 175
  • Also, the url you are calling is different from what I was trying to get data from. How/why did you choose to use this URL? – Aswath Manoharan Jun 11 '15 at 22:45
  • This is calling the AJAX page directly, your URL gets this url by AJAX, so I have skipped that. It does work right? – Arya Jun 12 '15 at 01:21
  • Yes, I get json and I am trying to parse that. One question I have is, how did you figure out the URL for the Ajax page'; – Aswath Manoharan Jun 13 '15 at 06:50
  • I am asking only because I want to understand the solution, before I accept it! – Aswath Manoharan Jun 13 '15 at 14:55
  • @AswathManoharan You have to capture the traffic while loading the page in the browser. One way would be to use Fiddler. http://www.telerik.com/fiddler that way you can see all the requests the browser is making – Arya Jun 14 '15 at 14:55