I have a collection of "articles", each 1 to 10 sentences long, written in a noisy, informal english (i.e. social media style). I need to extract some information from each article, where available, like date and time. I also need to understand what the article is talking about and who is the main "actor".
Example, given the sentence: "Everybody's presence is required tomorrow morning starting from 10.30 to discuss the company's financial forecast.", I need to extract:
- the date/time => "10.30 tomorrow morning".
- the topic => "company's financial forecast".
- the actor => "Everybody".
As far as I know, the date and time could be extracted without using NLP techniques but I haven't found anything as good as Natty (http://natty.joestelmach.com/) in Python.
My understanding on how to proceed after reading some chapters of the NLTK book and watching some videos of the NLP courses on Coursera is the following:
- Use part of the data to create an annotated corpus. I can't use off-the-shelf corpus because of the informal nature of the text (e.g. spelling errors, uninformative capitalization, word abbreviations, etc...).
- Manually (sigh...) annotate each article with tags from the Penn TreeBank tagset. Is there any way to automate this step and just check/fix the results ?
- Train a POS tagger on the annotated article. I've found the NLTK-trainer project that seems promising (http://nltk-trainer.readthedocs.org/en/latest/train_tagger.html).
- Chunking/Chinking, which means I'll have to manually annotate the corpus again (...) using the IOB notation. Unfortunately according to this bug report n-gram chunkers are broken: https://github.com/nltk/nltk/issues/367. This seems like a major issue, and makes me wonder whether I should keep using NLTK given that it's more than a year old.
- At this point, if I have done everything correctly, I assume I'll find actor, topic and datetime in the chunks. Correct ?
Could I (temporarily) skip 1,2 and 3 and produce a working, but possibly with a high error rate, implementation ? Which corpus should I use ?
I was also thinking of a pre-process step to correct common spelling mistakes or shortcuts like "yess", "c u" and other abominations. Anything already existing I can take advantage of ?
THE question, in a nutshell, is: is my approach at solving this problem correct ? If not, what am I doing wrong ?