1

I want to recognize some entities on texts that I have and I found a lot of algorithms (NaiveBayes, Hidden Markov Models, Conditional Random Field, etc.), but seems that almost all needs a huge training data to classify the entities.

I want to know if there is some algorithm that can recognize without having texts in training data, but maybe only words representing the data I want to recognize, or maybe some String Patterns, or another way.

The only thing I want to avoid is the necessity of having huge text as training data.

Michael J. Barber
  • 24,518
  • 9
  • 68
  • 88
Renato Dinhani
  • 35,057
  • 55
  • 139
  • 199

1 Answers1

2

If you have a short list of the kinds of named entities you'd like to find (usually called a "gazetteer") and no desire to manually annotate training data, you should look into work on bootstrapping named entity recognition. You can use bootstrapping either to extend a gazetteer or to develop a named entity recognizer. Some example approaches I found in a quick search are the following papers:

There's also been a fair amount of research on active learning for named entity recognition, which can significantly reduce the amount of training data that needs to be annotated if you do decide to do some manual annotation.

aab
  • 10,858
  • 22
  • 38