8

I am trying to parse out sentences from a huge amount of text. using java I started off with NLP tools like OpenNLP and Stanford's Parser.

But here is where i get stuck. though both these parsers are pretty great they fail when it comes to a non uniform text.

For example in my text most sentences are delimited by a period, but in some cases like bullet points they aren't. Here both the parses fail miserably.

I even tried setting the option in the stanford parses for multiple sentence terminators but the output was not much better!

Any ideas??

Edit :To make it simpler I am looking to parse text where the delimiter is either a new line ("\n") or a period(".") ...

hippietrail
  • 15,848
  • 18
  • 99
  • 158

5 Answers5

6

First you have to clearly define the task. What, precisely, is your definition of 'a sentence?' Until you have such a definition, you will just wander in circles.

Second, cleaning dirty text is usually a rather different task from 'sentence splitting'. The various NLP sentence chunkers are assuming relatively clean input text. Getting from HTML, or extracted powerpoint, or other noise, to text is another problem.

Third, Stanford and other large caliber devices are statistical. So, they are guaranteed to have a non-zero error rate. The less your data looks like what they were trained on, the higher the error rate.

bmargulies
  • 97,814
  • 39
  • 186
  • 310
  • MAakes a lot of sense. Just made me realize that i have to clean my data and then feed it into the parsers. (Now to look for a library to help me with data cleaning) – Roopak Venkatakrishnan Dec 15 '11 at 11:29
3

Write a custom sentence splitter. You could use something like the Stanford splitter as a first pass and then write a rule based post-processor to correct mistakes.

I did something like this for biomedical text I was parsing. I used the GENIA splitter and then fixed stuff after the fact.

EDIT: If you are taking in input HTML, then you should preprocess it first, for example handling bulleted lists and stuff. Then apply your splitter.

nflacco
  • 4,972
  • 8
  • 45
  • 78
  • This is what I thought to do, had problems since the stanford parses removes all the \n chars in the sentences. Still trying to find some way to work without them. – Roopak Venkatakrishnan Dec 15 '11 at 11:25
  • @nflacco, this is exactly my same situation! I'm doing sentence splitting on the GENIA dataset using Stanford CoreNLP, but sometimes it fails at detecting sentence boundaries. I'm thinking of post-processing by testing the regexp `\.\s+[A-Z]`. Do you agree? – Alphaaa Jun 04 '13 at 16:32
  • Exactly. You just need to make a list of common abbreviations- Mr. Dr. etc. - and combined with the regex you should cover 99% of the broken sentence boundaries. You can also look at sentence length. The common case I saw was Dr. or some medical abbreviation was treated as a sentence. Come on! Sentences don't have 1 or 2 words. A few simple rules fixes this nicely. – nflacco Jun 04 '13 at 20:36
1

If you would like to stick on Stanford NLP or OpenNLP, then you'd better retrain the model. Almost all of the tools in these packages are machine learning based. Only with customized training data, can they give you a ideal model and performance.

Here is my suggestion: manually split the sentences base on your criteria. I guess couple of thousand sentences is enough. Then call the API or command-line to retrain sentence splitters. Then you're done!

But first of all, one thing you need to figure out is, as said in previous threads: "First you have to clearly define the task. What, precisely, is your definition of 'a sentence?"

I'm using Stanford NLP and OpenNLP in my project, Dishes Map, A delicious dishes discovery engine, based on NLP and machine learning. They're working very well!

WDong
  • 49
  • 1
  • 5
1

There's one more excellent toolkit for natural language processing - GATE. It has number of sentence splitters, including standard ANNIE sentence splitter (doesn't fit you needs completely) and RegEx sentence splitter. Use later for any tricky splitting.

Exact pipeline for your purpose is:

  1. Document Reset PR.
  2. ANNIE English Tokenizer.
  3. ANNIE RegEx Sentence Splitter.

Also you can use GATE's JAPE rules for even more flexible pattern searching. (See Tao for full GATE documentation).

ffriend
  • 27,562
  • 13
  • 91
  • 132
0

For similar case what I did was separated the text into different sentences (separated by new lines) based on where I want the text to split. As in your case it is texts starting with bullets (or exactly the text with "line break tag " at end). This will also solve similar problem which may occur if you are working with the HTML for the same. And after separating those into different lines you can send the individual lines for the sentence detection, that will be more correct.