2

I want to clean an irregular web content - (may be html, pdf image etc) mostly html. I am using tika parser for that. But I dont know how to apply xpath as I use in html cleaner.

The code I use is,

BodyContentHandler handler = new BodyContentHandler();
Metadata metadata = new Metadata();
ParseContext context = new ParseContext();
URL u = new URL("http://stackoverflow.com/questions/9128696/is-there-any-way-to-reach-    drop-moment-in-drag-and-drop");
new HtmlParser().parse(u.openStream(),handler, metadata, context);
System.out.println(handler.toString());

But in this case I am getting no output. But for the url- google.com I am getting output.

In either case I don't know how to apply the xpath.

Any ideas please...

Tried by making my custom xpath as how body content handler uses,

HttpClient client = new HttpClient();
        GetMethod method = new GetMethod("http://stackoverflow.com/questions/9128696/is-there-any-way-to-reach-drop-moment-in-drag-and-drop");
        int status = client.executeMethod(method);
        HtmlParser parse = new HtmlParser();
        XPathParser parser = new XPathParser("xhtml", "http://www.w3.org/1999/xhtml");          
        //Matcher matcher = parser.parse("/xhtml:html/xhtml:body/descendant:node()");
       Matcher matcher = parser.parse("/html/body//h1");        
ContentHandler textHandler = new MatchingContentHandler(new WriteOutContentHandler(), matcher);
        Metadata metadata = new Metadata(); 
        ParseContext context = new ParseContext();
        parse.parse(method.getResponseBodyAsStream(), textHandler,metadata ,context);   
        System.out.println("content: " + textHandler.toString()); 

But not getting the content in the given xpath..

sriram
  • 712
  • 8
  • 26
  • 1
    Tried using XHTMLContentHandler and ToHTMLContentHandler as well. All I can get is the entire html content sans tags. Still can't find a method where I can give an input Xpath and which would return the contents present in the xpath. This is really eating lot of my time. Any help plz... – sriram Feb 06 '12 at 11:51

1 Answers1

2

I'd suggest you take a look at the source code for BodyContentHandler, which comes with Tika. BodyContentHandler only returns the xml within the body tag, based on an xpath

In general though, you should use a MatchingContentHandler to wrap your chosen ContentHandler with an XPath, which is what BodyContentHandler does internally.

Gagravarr
  • 47,320
  • 10
  • 111
  • 156