I need to parse a multiple(read approx 1600) HTML pages and pull out the contents of the following tag from each file.
textarea name="line" cols="66" rows="5" class="textbox" id="line" style="font-size:12px;" onkeydown="textCounter()" onkeyup="textCounter(); storeCaret(this);" onselect="storeCaret(this);" onclick="storeCaret(this);">TEXT I WANT IS HERE
(this is actually meant to be a html textarea tag) I had thought I could use a DOMparser but the files contain too many errors, and so I came across JTidy, from another question here on stackoverflow, and I have tried to use that...
But that doesnt seem to be able to convert the html from any of the pages into XHTML so I can then use a DOM parser.
I then thought I could use regex, but I couldnt quite find the particular expression needed to pull that text, and also I came across multiple questions/answers which said NOT to use regex to parse HTML...
SO essentially my question is there any other approach to take in order to get the text I need from a malformed html?