I have a a review data set of about 250000 reviews of hotels, I'm planing to extract aspects from it using crfsharp dll, however the data that I have is in normal text paragraph form and I need to convert it into the format of crfsharp so I can train and test data to extract aspects. Well can someone tell me what will be the best way to do that, I was thinking of writing a small program for data format conversion. Another thing I was wondering whether can CRF sharp do aspect extraction using crf models it has? I'm using c#.
1 Answers
What's features and tags will you use in your task ? There is a simplest example. For a sentence "! Tokyo and New York are major financial centers." If you want to extract location name from it and your only feature is token string, you can generate training corpus as belows:
! NOR Tokyo LOCATION and NOR New LOCATION York LOCATION are NOR major NOR financial NOR centers NOR . NOR
The first column is the term of the sentence, the second column is the corresponding tags. NOR means normal term, LOCATION means location name. You can generate training corpus as above format and use CRFSharp to train a model.
For more complex example, such as more features, template, adding word position in tags, you can refer another example in CRFSharp home page(http://crfsharp.codeplex.com).

- 11
- 1
-
yes I did it long before , had to create a small c# application to make that training corpus creation fast. :) thanks. – praxprog May 23 '14 at 06:55