New to NLP, Question about annotation

Question

I am new to NLP and I am looking for a starting point, in terms of some tutorials, documentation or example code. I have been told to research the possibilities of processing natural text to extract some structured data from it. For example I want to extract(annotate) height and weight from following statements. "He is 6 feet tall and weighs 200 pounds" or "His height is 6 feet and weight is 200" etc. I have looked into UIMA but it seems like a self created REGEX dictionary with no training capabilities. So in a nutshell, what Java framework can I use to create an annotation engine that can be trained as well! Any help(pointers) on this will be heavily appreciated. Thanks

btw. To learn about the start of the art in Information Extraction techniques, I would recommend to read a recent survey by Sunita Sarawagi - http://osm.cs.byu.edu/CS652s09/papers/Sarawagi.ieSurvey.pdf — Skarab, Dec 01 '10 at 17:49

score 5 · Answer 1 · answered Nov 30 '10 at 06:23

5

Since you asked for pointers: LingPipe (already mentioned above), OpenNLP, and Stanford NLP distributions.

Note: if Python is an option, you can use the Natural Language Toolkit.

answered Nov 30 '10 at 06:23

Sujith Surendranathan

2,569
17
21

1

+1, the best start is to go for NLP programming frameworks, because -- at this stage -- a beginner does not need to waste time to get into architectural details of solutions, such as GATE or apache UIMA. – Skarab Dec 01 '10 at 11:56
@Skarab I disagree, @NLP states he wants to create an annotation engine for fact extraction and that's exactly what GATE and UIMA are designed for. The libraries mentioned above will do lexical and syntactic analysis but there's still a lot of work to do after that. – Stompchicken Dec 01 '10 at 15:33
@StompChicken Recently I guided a student project and the participants decide to use apache UIMA. It really took them a lot of time, before they learnt enough to build the first real extraction pipelines. Personally I use UIMA and I can recommend it but after getting the first experience with lingpipe or Natural Language Toolkit. – Skarab Dec 01 '10 at 17:42
@Skarab I can't argue with the fact that UIMA is very heavyweight and hard to get started with. I do think GATE is a lot easier in that regard, by the way. I just think it's necessary in order to build a system flexible enough to actually do something useful. – Stompchicken Dec 01 '10 at 20:31

score 3 · Accepted Answer · answered Nov 30 '10 at 10:22

3

If you really want to want to use machine learning to train your annotator, then GATE is probably your best bet. Take a look at the chapter on machine learning in their guide.

answered Nov 30 '10 at 10:22

Stompchicken

15,833
1
33
38

@NLP don't forget to upvote StompChicken's answer, if you find it to be helpful. – dmcer Nov 30 '10 at 18:49

score 0 · Answer 3 · answered Aug 30 '17 at 17:55

0

I'd use NER. Here is the output I see for your input text:

You can try it here: http://deagol.cs.illinois.edu:8080

answered Aug 30 '17 at 17:55

Daniel

5,839
9
46
85

New to NLP, Question about annotation

3 Answers3

Linked