Writing an HTML Parser

Question

I am currently attempting (or planning to attempt) to write a simple (as possible) program to parse an html document into a tree.

After googling I have found many answers saying "don't do it it's been done" (or words to that effect); and references to examples of HTML parsers; and also a rather emphatic article on why one shouldn't use Regular expresions. However I haven't found any guides on the "right" way to write a parser. (This, by the way, is something I'm attempting more as a learning exersise than anything so I'd quite like to do it rather than use a premade one)

I believe I could make a working XML parser just by reading the document and adding the tags/text etc. to the tree, stepping up a level whenever I hit a close tag (again, simple, no fancy threading or efficiency required at this stage.). However, for HTML not all tags are closed.

So my question is this: what would you recommend as a way of dealing with this? The only idea I've had is to treat it in a similar way as the XML but have a list of tags that aren't necessarily closed each with conditions for closure (e.g. ends on or next tag).

Has anyone any other (hopefully better) suggestions? Is there a better way of doing this altogether?

Don't do it! But you will anyway. So have a look at http://jsoup.org — Lukas Eder, Aug 25 '11 at 14:28
Just to keep in mind, you have to worry about more than just tags which don't close themselves, you also have implicit opening tags ( is optional), plus the whole mess of badly-formed HTML code out there which HTML parsers manage to cope with. The HTML5 spec contains quite a specific parsing algorithm. — Matthew Wilson, Aug 25 '11 at 14:36
A very noble exercise :) My suggestion would be to look at the source code of an existing parser in your favourite language. — Richard H, Aug 25 '11 at 14:41
@Ghommey: writing your own code can provide an excellent learning experience, whereas using exisiting code does not. furthermore, you can modify it better, got exactly the functionality you need, and know exactly who's responsible if something goes wrong. ;) — Andreas Grapentin, Aug 25 '11 at 14:44
Do you want to parse completely valid HTML or real HTML? If you're going for a real HTML parser, you have to be careful of things like assuming all tags are closed. — Michael Mior, Aug 25 '11 at 14:45

score 13 · Answer 1 · answered Aug 25 '11 at 16:23

The looseness of HTML can be accommodated by figuring out the missing open and close tags as needed. This is essentially what a validator like tidy does.

You'll keep a stack (perhaps implicitly with a tree) of the current context. For example, {<html>, <body>} means you're currently in the body of the html document. When you encounter a new node, you compare the requirements for that node to what's currently on the stack.

Suppose your stack is currently just {html}. You encounter a  tag. You look up  in a table that tells you a paragraph must be inside the <body>. Since you're not in the body, you implicitly push <body> onto your stack (or add a body node to your tree). Then you can put the  into the tree.

Now supposed you see another . Your rules tell you that you cannot nest a paragraph within a paragraph, so you know you have to pop the current  off the stack (as though you had seen a close tag) before pushing the new paragraph onto the stack.

At the end of your document, you pop each remaining element off your stack, as though you had seen a close tag for each one.

The trick is to find a good way to represent the context requirements for each element.

this is closer, that means html parsing involves lot more than xml parsing and fixing few sloopy tags — duckduckgo, Feb 25 '14 at 04:12

score 9 · Accepted Answer · answered Aug 25 '11 at 14:42

so, I'll try for an answer here -

basically, what makes "plain" html parsing (not talking about valid xhtml here) different from xml parsing are loads of rules like never-ending <img>tags, or, strictly speaking, the fact that even the sloppiest of all html markups will somewhat render in a browser. You will need a validator along with the parser, to build your tree. But you'll have to decide on a standard for HTML you want to support, so that when you come across a weakness in the markup, you'll know it's an error and not just sloppy html.

know all the rules, build a validator, and then you'll be able to build a parser. that's Plan A.

Plan B would be, to allow for a certain error-resistance in your parser, which would render the validation step needless. For example, parse all the tags, and put them in a list, omitting any attributes, so that you can easily operate on the list, determining whether a tag is left open, or was never opened at all, to eventually get a "good" layout tree, which will be an approximate solution for sloppy layout, while being exact for correct layout.

hope that helped!

That's really useful actually. I hadn't thought of putting them in a list like that and using that as a basis for the tree (I was thinking it had to be done in one go, it didn't occur to me to make multiple passes over the document). Thanks ^^ — James, Aug 25 '11 at 15:09

score 8 · Answer 3 · answered Oct 30 '13 at 21:55

8

Since now the html5 standard exist, writing a html parser is no longer trial-and-error or arcane knowledge.

Instead you just have to implement the standardized parsing algorithm.

answered Oct 30 '13 at 21:55

BeniBela

16,412
4
45
52

DwB · Answer 4 · 2011-08-25T15:27:11.117

Harsh. Go

HTML is not XML. XHTML is XML. Most websites are HTML; some are XHTML. In XHTML all tags must be closed (or have no body, which is still closed).

If you want to write an HTML parser as a learning experiment, then go for it. If you want to write the next "Greaterest HTML parserer" then give it up. Apache (or somebody else) wins; the important information is: you don't know more than the large groups that specialize in parsing HTML.

To answer the question "How do I deal with this?" Read the W3C Spec on HTML. It answers your question. If your response is "but I don't want too" then you are actually saying "I'm a lazy goofrocket who wants to pretend to learn". If that is the case, I suggest you delete the post and move on; The Microsoft IE team probabaly has some documents that will interest you.

Less harsh answer

HTML is not easy to parse. At its loosest, you don't need head or body elements and alot of tags do not need to be closed. A basic rule when parsing HTML is if you encounter a new block element, automatically close the previous block element. You can not use a standard XML parser for this because HTML is not XML.

Similar to XML, you will need to split your document into elements, including free text elements.

XHTML is much easier because it must be well formed XML. You can use an XML parser for this.

I am aware this will not be the greatest parser ever, this is why I specified it as a learning exersise. I was unaware, however, that the spec had advice on the implementation of a parser, which I will now go and look at. ;) — James, Aug 25 '11 at 15:07

score 4 · Answer 5 · answered Oct 20 '20 at 13:02

Nearly a decade late, but whatever. If not relevant to you, it is to future visitors.

Another option would be to implement the specs.

The WHATWG has a normative specification for HTML. In this all the quirks are thought of, and you are save to not have forgotten some weird mechanic of HTML (there are a lot).

The specification also contains the section § 13.2 Parsing HTML documents, where it outlines how a User Agent (your parser) should parse a html document into a DOM tree. All edge cases are already thought of. The most difficult part is to use the right data structures and program flow in your language of choice to implement it.

Good luck and keep your spirit, reader!

score -2 · Answer 6 · answered Oct 30 '13 at 21:47

-2

Have you tried to use this library : http://simplehtmldom.sourceforge.net/ ?

F.

answered Oct 30 '13 at 21:47

guilb

119
1
2
11

Writing an HTML Parser

6 Answers6

Harsh. Go

Less harsh answer

Linked