1

I'm writing my own implementation of an HTML parser in JAVA. I have done lexer so far and proceeded to coding parser. I'm creating DOM tree and I would like to determine if my HTML is properly constructed.

E.g., I have a img tag which is a void tag based on w3 org html syntax

and it does not need end tag.

On the other hand, most of the tags like body, head must have its end tag.

My question is: what is the proper way to handle this ?

I don't need a tool or any external site for determing, I'm asking what is the way to determine.

Mario Cervera
  • 671
  • 1
  • 8
  • 19
dswiecki
  • 138
  • 2
  • 11
  • I might not be an HTML expert, but isn't there a limitied amout of **void** elements with reserved keywoards? http://www.programmerinterview.com/index.php/html5/void-elements-html5/ – AndrewMcCoist Jan 11 '16 at 21:34
  • @AndrewMcCoist there is, so probably a List with exceptions will do. – pietv8x Jan 11 '16 at 21:36
  • 1
    You'd have to know which tags are `void`, so you need an explicit list. Also note that some tags (e.g. [`

    `](https://www.w3.org/TR/html-markup/p.html#p)) have *optional* end tags. Fun, fun!

    – Andreas Jan 11 '16 at 21:36
  • `` and `` are more exceptions – pietv8x Jan 11 '16 at 21:37
  • nough said: http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454 – shapiro yaacov Jan 11 '16 at 22:08

2 Answers2

4

You're dealing with HTML, so the set of tags is pretty much limited. You can easily keep track if a tag is a void tag or not.

For the rest of the tags, I suggest the following algorithm:

  1. Get the next tag. (a) If it is an opening tag, for example <body>, simply push it to the Stack. (b) If it is a closing tag, go to step 2. (c) If there are no more tags left to parse, your HTML is valid.

  2. Pop tags from the Stack one-by-one. (a) If you encounter another opening tag on the stack before reaching the opening pair of your current tag, then your HTML structure is corrupt. (b) If you empty the stack and there still isn't a pair to your current closing tag, then again, your HTML is corrupt. (c) If you encounter the opening pair of your current tag avoiding cases (a) and (b). Go to step 1.

This is a rough pseudo-code, but I hope you get the idea. I can write an implementation in Java/C# if necessary.

george.zakaryan
  • 960
  • 1
  • 6
  • 18
4

This is a classic "job interview problem" in many software companies. It is about checking whether a String (in this case, your HTML code) is balanced with respect to some characters (in this case, HTML tags). This problem is solved using a Stack. When you are processing the String, for each opening tag, you invoke a "push" operation. For each closing tag, you invoke a "pop" operation. Your HTML code will be balanced if the Stack is empty at the end of the processing (and no error is found during the analysis). The function below checks whether a String is balanced in terms of parentheses.

private boolean isBalanced(String s) {

    Stack symbolStack = new Stack();

    for(int i = 0; i < s.length(); i++) { //Processing the input string ...

        char c = s.charAt(i);

        if(c == '(') { //If the character is an opening parenthesis --> push

            symbolStack.push(c);
        }
        else if(c == ')') { //If the character is a closing parenthesis ...

            if(symbolStack.isEmpty()) { //Error: the stack is empty
                return false;
            }
            else {
                char c2 = (char) symbolStack.pop();

                if(c2 != '(') { //Error: no opening parenthesis in the stack
                    return false;
                }
            }
        }
    }

    if(symbolStack.isEmpty()) { //No error and empty stack --> balanced string
        return true;
    }

    return false;
}
Mario Cervera
  • 671
  • 1
  • 8
  • 19