I need to detect sentence boundaries in HTML. There is lots of sentence boundary detection software out there (java.text.BreakIterator is the one I'm using), but all of it assumes plain text. HTML is richer than that, and includes some clues as to where sentences break.
For example, <p>, <ul>/<li>, <td>
and other tags mark sentence boundaries, or at least indicate that a sentence probably doesn't extend across them. <b>, <i>, <em>, <span>, <a>
and a few others tags could appear inside a sentence.
Is anyone aware of any software that takes advantage of HTML markup, in addition to the normal NLP stuff, in determining sentence boundaries?