0

I need to parse a multiple(read approx 1600) HTML pages and pull out the contents of the following tag from each file.

    textarea name="line" cols="66" rows="5" class="textbox" id="line" style="font-size:12px;" onkeydown="textCounter()" onkeyup="textCounter(); storeCaret(this);" onselect="storeCaret(this);" onclick="storeCaret(this);">TEXT I WANT IS HERE

(this is actually meant to be a html textarea tag) I had thought I could use a DOMparser but the files contain too many errors, and so I came across JTidy, from another question here on stackoverflow, and I have tried to use that...

But that doesnt seem to be able to convert the html from any of the pages into XHTML so I can then use a DOM parser.

I then thought I could use regex, but I couldnt quite find the particular expression needed to pull that text, and also I came across multiple questions/answers which said NOT to use regex to parse HTML...

SO essentially my question is there any other approach to take in order to get the text I need from a malformed html?

John McDonnell
  • 753
  • 1
  • 8
  • 24

1 Answers1

1

You should be able to parse your documents wit JTidy directly, without having to convert them to XHTML. I did it on several occasions, granted a while ago, but it worked for me fine and with quite ugly HTML.

EDIT: Another option that I looked at, last time I needed to parse HTML files, was TagSoup. I couldn't use it in a commercial product because of its GPL licence, but if you just need this functionality as an internal tool, it might work for you

Olaf
  • 6,249
  • 1
  • 19
  • 37
  • Im going to accept this as the answer since I think that TagSoup is he way to go, although I wasnt able to get it working. I managed to solve my problem by reverting back to using regex and I found a pattern that works for me... – John McDonnell Aug 28 '11 at 08:45