Parsing html with SAX parser

Question

I am trying to parse the normal html file using SAX parser.

SAXBuilder builder2 = new SAXBuilder();
         try {
            Document sdoc = (Document)builder2.build(readFile);
            NodeList nl=sdoc.getElementsByTagName("body");
            System.out.println("nodelist>>>>>>>>>>>"+nl.getLength());

        } catch (JDOMException e1) {
            e1.printStackTrace();
        }

but i am getting the exception

Open quote is expected for attribute "{1}" associated with an  element type  "class".

can anyone please tell me why i am getting this exception, the html document is well formed and it has all the open and close tags properly.

Thanks in advance.

Is there a specific reason why you want to do this with SAX? — flash, Oct 19 '11 at 07:02
No, just want to fetch the body content from the html file, So, i used it. Is there any other solution? — user972590, Oct 19 '11 at 07:24
With SAX you could parse XHTML, but I'm not sure if it can also parse HTML (at least most XML parser don't). HTML doesn't have to be a well-formed XML. — Mister Smith, Oct 19 '11 at 07:39

score 6 · Answer 1 · edited Oct 25 '18 at 00:38

6

As flash says, you need an HTML parser, not an XML parser. HTML is not XML.

Good parsers i've used are Neko and TagSoup. Neko is a good all-round parser; TagSoup specifically aims to be able to parse anything, no matter how ill-formed.

edited Oct 25 '18 at 00:38

Marcel

1,688
1
14
25

answered Oct 19 '11 at 07:58

Tom Anderson

46,189
17
92
133

1

The thing about TagSoup is that, being based on SAX, it's lightning fast and it solves all the stuff basic SAX chokes on including < and >. It's as easy to set up for as SAX; the handler methods are just the same ones--no learning curve beyond the SAX you already know. – Russ Bateman Dec 17 '15 at 18:08

score 4 · Answer 2 · answered Oct 19 '11 at 07:53

Generally speaking, you cannot parse HTML with an XML parser:

HTML's element tags are not required to match in all cases. (For example a <p> tag does not require a matching </p> tag.) This will cause terminal indigestion for an XML parser.
Real-world HTML is notorious for not being conformant to the HTML spec, let alone an XML compatible subset of HTML.

However, if your input document is XHTML, you should in theory be able to use an XML parser such as SAX. You should even be able to validate the document against the XHTML schema.

score 2 · Answer 3 · answered Oct 19 '11 at 07:45

2

Please have a look at HtmlParser. Normally SAX is not a good solution to parse html.

answered Oct 19 '11 at 07:45

flash

6,730
7
46
70

4

SAX is always a good option for parsing huge amounts of data - such as HTML. Try looking at TagSoup which is quite awesome for doing just that! – slott Jul 11 '15 at 11:38

score -1 · Answer 4 · answered Jan 02 '23 at 18:03

-1

Another HTML parser for Java is JSoup: https://jsoup.org/

answered Jan 02 '23 at 18:03

S. Doe

685
1
6
25

Parsing html with SAX parser

4 Answers4

Linked