3

I'm doing a project regarding html document manipulation. I want body content from existing html document to modify it into a new html.Now i'm using JDOM. i want to use body element in my coding.For that i used getChild("body") in my coding.But it returns null to my program.But my html document have a body element.Could anybody help me to know this problem as i'm a student?

would appreciate pointers..

Coding:

import org.jdom.Document;
import org.jdom.Element;
public static void getBody() {
SAXBuilder builder = new SAXBuilder("org.ccil.cowan.tagsoup.Parser", true);
org.jdom.Document jdomDocument=builder.build("http://www......com");
Element root = jdomDocument.getRootElement();
      //It returns null
System.out.println(root.getChild("body"));
}

please refer these too.. My html's root and childs printed in console...

root.getName():html

SIZE:2

[Element: <head [Namespace: http://www.w3.org/1999/xhtml]/>]

[Element: <body [Namespace: http://www.w3.org/1999/xhtml]/>]
javanna
  • 59,145
  • 14
  • 144
  • 125
Arun
  • 182
  • 1
  • 2
  • 14
  • What's the value of root? It may be that the line before Element root = jdomDocument.getRootElement() isn't giving you what you expect, and the element that you're getting back doesn't have body as a child element (ie. if getRootElement() is giving you the body already). – Anthony Grist Mar 10 '11 at 12:08
  • @Anthony Grist Thank you for your attention.Element root is getting and i acn pass it to another function also.But i can't get body element from root.It is getting when i use html file as a string instead of url.. – Arun Mar 10 '11 at 12:18
  • @Arun Try replacing System.out.println(root.getChild("body")); with System.out.println(root); or System.out.println(root.getName()); - that will show you which element you're getting from the getRootElement() call. It's either not the element you're expecting that contains body as a direct child, or it is the element you're expecting and something else is causing getChild("body") not to find the body element when you call it. – Anthony Grist Mar 10 '11 at 12:37
  • 1
    Looking at the JDOM API for the Element class, there are two getChild methods; one that accepts just a String as an argument and the other that accepts a String and a Namespace object as arguments. If your body element has a Namespace attached to it, then getChild("body") will return null, even if the body element is a direct child of root. It may be worth trying root.getChild("body", root.getNamespace()) rather than root.getChild("body") as well. – Anthony Grist Mar 10 '11 at 12:41
  • @Anthony Grist A little clarification: this works only if the element has the same namespace uri as the root element. – javanna Mar 10 '11 at 12:54
  • @ Anthony Grist,Your assumption was correct.That was the problem behind my coding.i.e. i forgot to add namespace.. Thank you to give me a the right way.. – Arun Mar 10 '11 at 12:59

3 Answers3

9

I've found some problems in your code: 1) if you want to build a remote xml through the net, you should user another build method which receives an URL as input. Actually you're parsing the file with name "www......com" as an xml.

Document jdomDocument = builder.build( new URL("http://www........com"));

2) if you want to parse an html page as xml, you have to check that it is a well formed xhtml document, otherwise you can't parse it as xml

3) as I've already said you in another answer, the root.getChild("body") returns root's child which name is "body", without namespace. You should check the namespace for the element that you're looking for; if it has a qualified namespace you have to pass it in this way:

root.getChild("body", Namespace.getNamespace("your_namespace_uri"));

To know which namespace has your element in an easy way, you should print out all root's children using getChildren method:

for (Object element : doc.getRootElement().getChildren()) {
    System.out.println(element.toString());
}

If you're trying to parse an xhtml, probably you have namespace uri http://www.w3.org/1999/xhtml. So you should do this:

root.getChild("body", Namespace.getNamespace("http://www.w3.org/1999/xhtml"));
javanna
  • 59,145
  • 14
  • 144
  • 125
  • yaaaaaa @javanna Thank you very very much...You helped me again..You said it,that's the reason behind it.Thank you once again to let me know about my problem... – Arun Mar 10 '11 at 12:54
2

What makes you feel like you require org.ccil.cowan.tagsoup.Parser? What does it provide you that the parser built into the JDK does not?

I'd try it using another constructor for SAXBuilder. Use the parser built into the JDK and see if that helps.

Start by printing out the entire tree using XMLOutputter.

public static void getBody() 
{
    SAXBuilder builder = new SAXBuilder(true);
    Document document = builder.build("http://www......com");
    XMLOutputter outputter = new XMLOutputter();
    outputter.output(document, System.out);  // do something w/ exception
}
duffymo
  • 305,152
  • 44
  • 369
  • 561
  • I suspect the HTML code the OP is trying to parse is not valid XHTML, and that a more forgiving parser needs to be used, hence his usage of TagSoup which handles wild invalid HTML. – JB Nizet Mar 10 '11 at 12:16
  • If that's the case, why is validation turned on? And why JDON, since it'll choke on all invalid XML? And why would an exception not be thrown? Too many questions, not enough info. – duffymo Mar 10 '11 at 12:21
  • Thank you to give me a suggestion.I used Tagsoup because of same reason that @JB Nizet said. I tried your suggestion,but it is not working.Compilation stuckedin parsing.i want to deal with htmls that are not a well-formatted.That's why i turned on validation.Moreover i am a student in developement.Could you help me to resolve this problem? – Arun Mar 10 '11 at 12:38
  • Thank you for your suggestion.I got answer for that.problem was that i forgot to add namespace as parameter. – Arun Mar 10 '11 at 12:57
1
import org.jdom.Document;
import org.jdom.Element;
public static void getBody() {
SAXBuilder builder = new SAXBuilder("org.ccil.cowan.tagsoup.Parser", true);
org.jdom.Document jdomDocument=builder.build("http://www......com");
Element root = jdomDocument.getRootElement();
      //It returns null
System.out.println(root.getChild("body", Namespace.getNamespace("my_name_space")));
}
Arun
  • 182
  • 1
  • 2
  • 14