How can I determine if a HTML document is well formed or not in JAVA?

Question

Heyy guys, I need to determine if a given HTML Document is well formed or not.
I just need a simple implementation using only Java core API classes i.e. no third party stuff like JTIDY or something. Thanks.

Actually, what is exactly needed is an algorithm that scans a list of TAGS. If it finds an open tag, and the next tag isn't its corresponding close tag, then it should be another open tag which in turn should have its close tag as the next tag, and if not it should be another open tag and then its corresponding close tag next, and the close tags of the previous open tags in reverse order coming next on the list. I've already written methods to convert a tag to a close tag. If the list conforms to this order then it returns true or else false.

Here is the skeleton code of what I've started working on already. Its not too neat, but it should give you guys a basic idea of what I'm trying to do.

public boolean validateHtml(){

    ArrayList<String> tags = fetchTags();
    //fetchTags returns this [<html>, <head>, <title>, </title>, </head>, <body>, <h1>, </h1>, </body>, </html>]

    //I create another ArrayList to store tags that I haven't found its corresponding close tag yet
    ArrayList<String> unclosedTags = new ArrayList<String>();

    String temp;

    for (int i = 0; i < tags.size(); i++) {

        temp = tags.get(i);

        if(!tags.get(i+1).equals(TagOperations.convertToCloseTag(tags.get(i)))){
            unclosedTags.add(tags.get(i));
            if(){

            }

        }else{
            return true;//well formed html
        }
    }

    return true;
}

Or, in short, Java core API does not offer support for html parsing or validating. So we can't offer a simple implementation, if you reject "third party stuff" right from the start. — Andreas Dolk, Mar 01 '11 at 12:50
Oh come on guys, this will only take a few lines of code that fit nicely in an answer, won't it? — Daniel, Mar 01 '11 at 12:51
yeah, I've actually developed an algorithm, but my problem here is I'm not good with String manipulation. If only there was a way I could fetch out every html TAG from the HTML document, then I think I'd be able to sort it out. — sacretruth, Mar 01 '11 at 12:58
If I had a dollar for each algorithm that I "developed" on paper, but never implemented... — thkala, Mar 01 '11 at 13:08
@kooldave98 - well-formed is more then just looking at tags - and, keep in mind, you can add some `CDATA` that contains an example of non-wellformed html on a page... the task is far from trivial regexp'ing the html tags. — Andreas Dolk, Mar 01 '11 at 13:08
You cannot get the tags without a full blown HTML parser - and there is nothing simple about an HTML parser. Even if it was (relatively) simple, why are you so intent on reinventing the wheel by rejecting third-party solutions out of hand? — thkala, Mar 01 '11 at 13:17

Eamonn McEvoy · Accepted Answer · 2011-03-01T13:54:51.717

1

Yeah string manipulation can seem like a pickle sometimes, you need to do something like

First copy html into an array

bool tag = false;
string str = "";
List<string> htmlTags = new List();

for(int i = 0; i < array.length; i++)
{ 
  //Check for the start of a tag
  if(array[i] == '<')
  {
    tag == true;
  }

  //If the current char is part of a tag start copying
  if(tag)
  {
    str += char;
  }

  //When a tag ends add the tag to your tag list
  if(array[i] == '>')
  {
    htmlTags.Add(str);
    str = "";
    tag == false;
  }
}

Something like this should get you started, you should end up with an array of tags, this is only pseudo code so it wont shouldn't compile

edited Mar 01 '11 at 13:54

answered Mar 01 '11 at 13:14

Eamonn McEvoy

8,876
14
53
83

thanks a lot...It worked perfectly, and I am already working on the algorithm to determine the well-formedness of a html document :-) – sacretruth Mar 01 '11 at 16:08
do you think you could help me out with an algorithm that scans a list of TAGS. If it finds an open tag, and the next tag isn't its corresponding close tag, then it should be another open tag which in turn should have its close tag as the next tag, and if not it should be another open tag and then its corresponding close tag next, and the close tags of the previous open tags in reverse order coming next on the list. I've already written methods to convert a tag to a close tag. If the list conforms to this order then it returns true or else false. – sacretruth Mar 01 '11 at 19:52

score 0 · Answer 2 · answered Mar 01 '11 at 12:50

0

Don't think you can do this without undertaking a huge amount of work, would be much easier to use a third party package

answered Mar 01 '11 at 12:50

Eamonn McEvoy

8,876
14
53
83

yeah, I've actually developed an algorithm, but my problem here is I'm not good with String manipulation. If only there was a way I could fetch out every html TAG from the HTML document, then I think I'd be able to sort it out. – sacretruth Mar 01 '11 at 12:59
check out my other post, it might get you started :) – Eamonn McEvoy Mar 01 '11 at 13:43

score 0 · Answer 3 · answered Mar 01 '11 at 13:47

0

Try validating against HTML4 or 4.1 or XHTML 1 DTD

"strict.dtd"
"loose.dtd"
"frameset.dtd"

Which might help !

answered Mar 01 '11 at 13:47

Ratna Dinakar

1,573
13
16

How can I determine if a HTML document is well formed or not in JAVA?

3 Answers3