3

I need to detect sentence boundaries in HTML. There is lots of sentence boundary detection software out there (java.text.BreakIterator is the one I'm using), but all of it assumes plain text. HTML is richer than that, and includes some clues as to where sentences break.

For example, <p>, <ul>/<li>, <td> and other tags mark sentence boundaries, or at least indicate that a sentence probably doesn't extend across them. <b>, <i>, <em>, <span>, <a> and a few others tags could appear inside a sentence.

Is anyone aware of any software that takes advantage of HTML markup, in addition to the normal NLP stuff, in determining sentence boundaries?

hippietrail
  • 15,848
  • 18
  • 99
  • 158
ccleve
  • 15,239
  • 27
  • 91
  • 157
  • Is it an option to do some preprocessing? Like replace all container tags (

    ...) with a . and strip out all other tags ( ... regex: <.+?>) to get _almost_ plain text.
    – Jason Dunkelberger Jul 25 '12 at 17:49
  • Yes, I can preprocess. The question is, how? Which tags mean what? Are there other syntactic considerations in HTML that I haven't thought of? I'm looking for a solution to the problem that someone else has already thought through. – ccleve Jul 25 '12 at 19:37
  • See my answer in http://stackoverflow.com/questions/11236328/determining-paragraphs-from-sentence-location-within-an-html-document/ then after you get the content text, you can proceed with using the usual sentence splitters and tokenizers. – Kenston Choi Jul 26 '12 at 12:10
  • can you explain what do you mean by sentence boundary ? you can just make array of such tags and find using index of or splitting the whole document by them. – mrd081 Jul 25 '12 at 17:23
  • Sentence boundary disambiguation: http://en.wikipedia.org/wiki/Sentence_boundary_disambiguation It's a well-known problem. – ccleve Jul 25 '12 at 19:33

1 Answers1

1

The solution I implemented was 1. split the document into separate blocks on all html tags except the inline tags (<i>, <b>, <span>, etc.), 2. strip the inline tags from each block, 3. look for sentences within each block using traditional methods.

ccleve
  • 15,239
  • 27
  • 91
  • 157