Extracting information from unstructured text

Question

I have a collection of "articles", each 1 to 10 sentences long, written in a noisy, informal english (i.e. social media style). I need to extract some information from each article, where available, like date and time. I also need to understand what the article is talking about and who is the main "actor".

Example, given the sentence: "Everybody's presence is required tomorrow morning starting from 10.30 to discuss the company's financial forecast.", I need to extract:

the date/time => "10.30 tomorrow morning".
the topic => "company's financial forecast".
the actor => "Everybody".

As far as I know, the date and time could be extracted without using NLP techniques but I haven't found anything as good as Natty (http://natty.joestelmach.com/) in Python.

My understanding on how to proceed after reading some chapters of the NLTK book and watching some videos of the NLP courses on Coursera is the following:

Use part of the data to create an annotated corpus. I can't use off-the-shelf corpus because of the informal nature of the text (e.g. spelling errors, uninformative capitalization, word abbreviations, etc...).
Manually (sigh...) annotate each article with tags from the Penn TreeBank tagset. Is there any way to automate this step and just check/fix the results ?
Train a POS tagger on the annotated article. I've found the NLTK-trainer project that seems promising (http://nltk-trainer.readthedocs.org/en/latest/train_tagger.html).
Chunking/Chinking, which means I'll have to manually annotate the corpus again (...) using the IOB notation. Unfortunately according to this bug report n-gram chunkers are broken: https://github.com/nltk/nltk/issues/367. This seems like a major issue, and makes me wonder whether I should keep using NLTK given that it's more than a year old.
At this point, if I have done everything correctly, I assume I'll find actor, topic and datetime in the chunks. Correct ?

Could I (temporarily) skip 1,2 and 3 and produce a working, but possibly with a high error rate, implementation ? Which corpus should I use ?

I was also thinking of a pre-process step to correct common spelling mistakes or shortcuts like "yess", "c u" and other abominations. Anything already existing I can take advantage of ?

THE question, in a nutshell, is: is my approach at solving this problem correct ? If not, what am I doing wrong ?

On the academic level, this is http://en.wikipedia.org/wiki/Semantic_role_labeling . — cyborg, Mar 26 '14 at 10:23

score 4 · Accepted Answer · answered Sep 05 '14 at 01:16

Could I (temporarily) skip 1,2 and 3 and produce a working, but possibly with a high error rate, implementation ? Which corpus should I use ?

I was also thinking of a pre-process step to correct common spelling mistakes or shortcuts like "yess", "c u" and other abominations. Anything already existing I can take advantage of ?

I would suggest you first have a go at processing standard language text. The pre-processing you refer to is an NLP task in its own right, known as normalization. Here is a resource for Twitter normalization: http://www.ark.cs.cmu.edu/TweetNLP/ , additionally, you can use spell checking, sentence boundary detection, ...

THE question, in a nutshell, is: is my approach at solving this problem correct ? If not, what am I doing wrong ?

If you make abstraction of normalization, I think your approach is valid. With regard to automating the annotation process: you can bootstrap the process by using off-the-shelf components first, after which you correct, retrain, and so on, ... during different iterations. To get acceptable results, you will need to do your steps 2, 3, and 4 a couple of times.

If you are interested in understanding the problem and being able to optimize existing solutions, I would suggest you focus on tools that allow you to develop your own models. If you prioritize getting results over being able to develop your own models, I would recommend looking into existing open source text engineering frameworks such as Gate (https://gate.ac.uk/) UIMA (http://uima.apache.org/) and DKPro (which extends UIMA) (https://code.google.com/p/dkpro-core-asl/). All three frameworks wrap existing components, so you have a wide range of possible solutions.

score 0 · Answer 2 · answered Aug 30 '17 at 17:58

0

I'd suggesting giving a try to NER and Temporal Normalizer. Here is what I see for your example sentence:

You can try the demo here: http://deagol.cs.illinois.edu:8080/

answered Aug 30 '17 at 17:58

Daniel

5,839
9
46
85

Extracting information from unstructured text

2 Answers2