2

I am new to NLP and I am looking for a starting point, in terms of some tutorials, documentation or example code. I have been told to research the possibilities of processing natural text to extract some structured data from it. For example I want to extract(annotate) height and weight from following statements. "He is 6 feet tall and weighs 200 pounds" or "His height is 6 feet and weight is 200" etc. I have looked into UIMA but it seems like a self created REGEX dictionary with no training capabilities. So in a nutshell, what Java framework can I use to create an annotation engine that can be trained as well! Any help(pointers) on this will be heavily appreciated. Thanks

Sap
  • 5,197
  • 8
  • 59
  • 101
  • btw. To learn about the start of the art in Information Extraction techniques, I would recommend to read a recent survey by Sunita Sarawagi - http://osm.cs.byu.edu/CS652s09/papers/Sarawagi.ieSurvey.pdf – Skarab Dec 01 '10 at 17:49

3 Answers3

5

Since you asked for pointers: LingPipe (already mentioned above), OpenNLP, and Stanford NLP distributions.

Note: if Python is an option, you can use the Natural Language Toolkit.

Sujith Surendranathan
  • 2,569
  • 17
  • 21
  • 1
    +1, the best start is to go for NLP programming frameworks, because -- at this stage -- a beginner does not need to waste time to get into architectural details of solutions, such as GATE or apache UIMA. – Skarab Dec 01 '10 at 11:56
  • @Skarab I disagree, @NLP states he wants to create an annotation engine for fact extraction and that's exactly what GATE and UIMA are designed for. The libraries mentioned above will do lexical and syntactic analysis but there's still a lot of work to do after that. – Stompchicken Dec 01 '10 at 15:33
  • @StompChicken Recently I guided a student project and the participants decide to use apache UIMA. It really took them a lot of time, before they learnt enough to build the first real extraction pipelines. Personally I use UIMA and I can recommend it but after getting the first experience with lingpipe or Natural Language Toolkit. – Skarab Dec 01 '10 at 17:42
  • @Skarab I can't argue with the fact that UIMA is very heavyweight and hard to get started with. I do think GATE is a lot easier in that regard, by the way. I just think it's necessary in order to build a system flexible enough to actually do something useful. – Stompchicken Dec 01 '10 at 20:31
3

If you really want to want to use machine learning to train your annotator, then GATE is probably your best bet. Take a look at the chapter on machine learning in their guide.

Stompchicken
  • 15,833
  • 1
  • 33
  • 38
0

I'd use NER. Here is the output I see for your input text: enter image description here

You can try it here: http://deagol.cs.illinois.edu:8080

Daniel
  • 5,839
  • 9
  • 46
  • 85