0

Before asking this question I tried several different methods and of course tried googling for some direction/answers. I've checked through StackOverflow and can't seem to find a solution.

Basically, I want to create a tool that returns data based on a url and xpath for example

URL:        http://www.google.co.uk/search?q=wicked+games
XPath:      id('rso')/li/div/h3/a

which should return these results

http://puu.sh/3V4JG.jpg

I can parse the XML fine from other URL's for example if I was to grab an exact XML file such as http://renualsoft.com/jordon/person.xml however I'm unsure how I would do this for google?

I tried this

    DocumentBuilderFactory factory = DocumentBuilderFactory.newInstance();
    factory.setNamespaceAware(true);
    DocumentBuilder builder;
    Document doc = null;
    XPathExpression expr = null;
    builder = factory.newDocumentBuilder();
    doc = builder.parse("http://www.google.co.uk/search?q=wicked+games");
    XPathFactory xFactory = XPathFactory.newInstance();
    XPath xpath = xFactory.newXPath();

    expr = xpath.compile("id('rso')/li/div/h3/a/@href");
    Object result = expr.evaluate(doc, XPathConstants.NODESET);
    NodeList nodes = (NodeList) result;
    for (int i = 0; i < nodes.getLength(); i++) {
        System.out.println(nodes.item(i).getNodeValue());
    }

However I get this exception

Exception in thread "main" java.io.IOException: Server returned HTTP response code: 403 for URL: http://www.google.co.uk/search?q=wicked+games
    at sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1625)
    at com.sun.org.apache.xerces.internal.impl.XMLEntityManager.setupCurrentEntity(XMLEntityManager.java:633)
    at com.sun.org.apache.xerces.internal.impl.XMLVersionDetector.determineDocVersion(XMLVersionDetector.java:189)
    at com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11Configuration.java:799)
    at com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11Configuration.java:764)
    at com.sun.org.apache.xerces.internal.parsers.XMLParser.parse(XMLParser.java:123)
    at com.sun.org.apache.xerces.internal.parsers.DOMParser.parse(DOMParser.java:237)
    at com.sun.org.apache.xerces.internal.jaxp.DocumentBuilderImpl.parse(DocumentBuilderImpl.java:300)
    at javax.xml.parsers.DocumentBuilder.parse(DocumentBuilder.java:177)
    at NewEmptyJUnitTest.query(NewEmptyJUnitTest.java:35)
    at NewEmptyJUnitTest.main(NewEmptyJUnitTest.java:77)
Java Result: 1

Any help or guidance would be great thanks, I have tried looking elsewhere for an answer but like I said I couldn't find anything useful.

TehBawz
  • 29
  • 1
  • 6
  • I just noticed a fun tag description. See the google tag. – keyser Aug 06 '13 at 11:15
  • 2
    this happens because of not setting the user agent. Also Google doesn't want you to fetch their search results that way. Its against their TOS. Use google search api for a nicer cleaner way to search – SoWhat Aug 06 '13 at 11:17
  • @keyser yup. good find ;) – SoWhat Aug 06 '13 at 11:19

1 Answers1

0

Is HTMLUnit smth. for you?

import com.gargoylesoftware.htmlunit.BrowserVersion;
import com.gargoylesoftware.htmlunit.FailingHttpStatusCodeException;
import com.gargoylesoftware.htmlunit.WebClient;
import com.gargoylesoftware.htmlunit.html.HtmlAnchor;
import com.gargoylesoftware.htmlunit.html.HtmlPage;

class Example
{
    public static void main(final String args[]) throws FailingHttpStatusCodeException, MalformedURLException, IOException
    {
        final WebClient webClient = new WebClient(BrowserVersion.FIREFOX_17);
        webClient.getOptions().setCssEnabled(false);

        final HtmlPage page = webClient.getPage("http://www.google.co.uk/search?q=wicked+games");

        final List<?> byXPath = page.getByXPath("//ol['rso']//h3/a");

        for (final Object object : byXPath)
        {
            System.out.println(((HtmlAnchor) object).getTextContent());
        }
    }
}

This will print:

Chris Isaak - Wicked Game - YouTube The Weeknd - Wicked Games (Explicit) -
YouTube Emika - Wicked Game - YouTube Wicked Game - Wikipedia, the
free encyclopedia THE WEEKND - WICKED GAMES LYRICS THE WEEKND LYRICS -
Wicked Games - A-Z Lyrics The Weeknd – Wicked Games Lyrics | Rap
Genius Chris Isaak - Wicked Game - Video Dailymotion Wicked Game |
Chris Isaak | Music Video | MTV Wicked Games

Maven Dependency:

<dependency>
    <groupId>net.sourceforge.htmlunit</groupId>
    <artifactId>htmlunit</artifactId>
    <version>2.12</version>
</dependency>
d0x
  • 11,040
  • 17
  • 69
  • 104
  • Hey, this is returning a Exception in thread "main" java.lang.NoClassDefFoundError: org/apache/http/NoHttpResponseException – TehBawz Aug 06 '13 at 12:03
  • @JordonBarber you added the maven dependency? – d0x Aug 06 '13 at 15:44
  • This Class comes from the commons-httpclient package. This should be in your classpath. (It comes with HTMLUnit) – d0x Aug 07 '13 at 06:39
  • If you do it by hand, there are some dependencies u have to add. You can see them here: http://htmlunit.sourceforge.net/dependencies.html With maven it is much more easy. – d0x Aug 07 '13 at 10:16