Classifying words inside a document

Question

The problem that I'm facing is: I want to read a document, get the raw string of this document, and classify the information. For example, I want to identify when the string is a "Name", or a "date" ou some other useful information.

Is it possible to use machine learning to do that? How may I approach the problem?

The most hard problem here is that I'm not trying to classify the document itself, but the String information inside the document.

Why not? Just consider a String as a short text itself. Check these posts: http://stats.stackexchange.com/questions/118513/algorithm-recommendation-for-string-classification, http://stats.stackexchange.com/questions/79765/improve-precision-in-text-classification. — Vadim Shkaberda, Jun 02 '16 at 14:14

score 2 · Accepted Answer · answered Jun 02 '16 at 15:13

2

So it's all about how you think about your problem. I think your problem can be formulated as an entity extraction/recognition problem, where you have a document and want to identify specific entities within the text (where an entity might be a person, date, etc). Take a look at Conditional Random Fields and their applications to named entity recognition (NER for short), as there are some libraries & tools already implemented.

For example, check out StanfordNER.

answered Jun 02 '16 at 15:13

rabbit

1,476
12
16

Thanks for this tip. I think this is just what I'm looking for. My problem seems recurrent, but I didn't know this acronym NER. Thanks. – Eduardo Briguenti Vieira Jun 02 '16 at 16:20

Classifying words inside a document

1 Answers1