0

The problem that I'm facing is: I want to read a document, get the raw string of this document, and classify the information. For example, I want to identify when the string is a "Name", or a "date" ou some other useful information.

Is it possible to use machine learning to do that? How may I approach the problem?

The most hard problem here is that I'm not trying to classify the document itself, but the String information inside the document.

Eduardo Briguenti Vieira
  • 4,351
  • 3
  • 37
  • 49
  • Why not? Just consider a String as a short text itself. Check these posts: http://stats.stackexchange.com/questions/118513/algorithm-recommendation-for-string-classification, http://stats.stackexchange.com/questions/79765/improve-precision-in-text-classification. – Vadim Shkaberda Jun 02 '16 at 14:14
  • Thanks for the feedback Vadim. I'll take a look – Eduardo Briguenti Vieira Jun 02 '16 at 16:18

1 Answers1

2

So it's all about how you think about your problem. I think your problem can be formulated as an entity extraction/recognition problem, where you have a document and want to identify specific entities within the text (where an entity might be a person, date, etc). Take a look at Conditional Random Fields and their applications to named entity recognition (NER for short), as there are some libraries & tools already implemented.

For example, check out StanfordNER.

rabbit
  • 1,476
  • 12
  • 16