0

Write a program with the following objective - be able to identify whether a word/phrase represents a thing/product. For example - 1) "A glove comprising at least an index finger receptacle, a middle finger receptacle.." <-Be able to identify glove as a thing/product. 2) "In a window regulator, especially for automobiles, in which the window is connected to a drive..." <- be able to identify regulator as a thing. Doing this tells me that the text is talking about a thing/product. as a contrast, the following text talks about a process instead of a thing/product -> "An extrusion coating process for the production of flexible packaging films of nylon coated substrates consisting of the steps of..."

I have millions of such texts; hence, manually doing it is not feasible. So far, with the help of using NLTK + Python, I have been able to identify some specific cases which use very similar keywords. But I have not been able to do the same with the kinds mentioned in the examples above. Any help will be appreciated!

singhalc
  • 343
  • 1
  • 2
  • 8
  • What have you already tried and how to you expect the SO community to be able to help you? – user1998698 Feb 18 '15 at 01:12
  • @user1998698 : What I tried- If the text is worded like "In an **apparatus** for..."-If the text contains generic keywords like apparatus/device etc., I do a simple keyword search & comparison to classify the text as talking about a 'thing/product'. But if the text has name of actual product like glove or engine, I don't know how to identify the word as a thing/product. The SO community can help me by suggesting a way to implement this. Can this be done, if yes, how? A code example will be ideal, but a pointer to some useful function, concept in NLP, NLTK or beyond will work too! – singhalc Feb 18 '15 at 02:12
  • Have you tried the Stanford NLP utilities? nlp.stanford.edu/research.shtml – fps Feb 18 '15 at 02:55
  • @ Magnamag : No I have not as yet. Can you please point me out to something more specific since there are many Stanford NLP research streams. – singhalc Feb 18 '15 at 03:40

2 Answers2

0

What you want to do is actually pretty difficult. It is a sort of (very specific) semantic labelling task. The possible solutions are:

  • create your own labelling algorithm, create training data, test, eval and finally tag your data
  • use an existing knowledge base (lexicon) to extract semantic labels for each target word

The first option is a complex research project in itself. Do it if you have the time and resources.

The second option will only give you the labels that are available in the knowledge base, and these might not match your wishes. I would give it a try with python, NLTK and Wordnet (interface already available), you might be able to use synset hypernyms for your problem.

emiguevara
  • 1,359
  • 13
  • 26
  • Thank you for your reply. The second option talks about using a database of words, in lay English, but wouldn't that mean this database would contain basically every word that can represent a thing on this planet? To elaborate, my dataset can talk about anything that can be manufactured. This means that I would have to compare my dataset to a database with a humongous list of possible product words? – singhalc Feb 19 '15 at 19:08
  • Hi. You are more or less on the right track. A lexical database cannot possibly contain every word, as this would be infeasible, but only a fairly large number of them (Wordnet 3.0 has about 160000 different strings, 120000 nouns). Your application does not need to compare each word to every entry in the database, that would be silly. You can design it in many ways, but I suppose that for each target word, a single lookup should give you that word's synset hypernyms and, with that information, you should be able to make a decision. – emiguevara Feb 20 '15 at 13:04
  • Could you please elaborate with an example? So say, if I were to consider the first example in my question where the target is 'glove'. – singhalc Feb 21 '15 at 03:21
  • No, I can't. Read the documentation for Wordnet and see if it can help you. – emiguevara Feb 23 '15 at 10:12
-1

This task is called named entity reconition problem.

EDIT: There is no clean definition of NER in NLP community, so one can say this is not NER task, but instance of more general sequence labeling problem. Anyway, there is still no tool that can do this out of the box.

Out of the box, Standford NLP can only recognize following types:

Recognizes named (PERSON, LOCATION, ORGANIZATION, MISC), numerical (MONEY, NUMBER, ORDINAL, PERCENT), and temporal (DATE, TIME, DURATION, SET) entities

so it is not suitable for solving this task. There are some commercial solutions that possible can do the job, they can be readily found by googling "product name named entity recognition", some of them offer free trial plans. I don't know any free ready to deploy solution.

Of course, you can create you own model by hand-annotating about 1000 or so product name containing sentences and training some classifier like Conditional Random Field classifier with some basic features (here is documentation page that explains how to that with stanford NLP). This solution should work reasonable well, while it won't be perfect of course (no system will be perfect but some solutions are better then others).

EDIT: This is complex task per se, but not that complex unless you want state-of-the art results. You can create reasonable good model in just 2-3 days. Here is (example) step-by-step instruction how to do this using open source tool:

  • Download CRF++ and look at provided examples, they are in a simple text format
  • Annotate you data in a similar manner
    a OTHER 
    glove PRODUCT 
    comprising OTHER
    ... 

and so on.

Spilt you annotated data into two files train (80%) and dev(20%)

  1. use following baseline template features (paste in template file)
    

    U02:%x[0,0]
    U01:%x[-1,0]
    U01:%x[-2,0]
    U02:%x[0,0]
    U03:%x[1,0]
    U04:%x[2,0]
    U05:%x[-1,0]/%x[0,0]
    U06:%x[0,0]/%x[1,0]

4.Run

crf_learn template train.txt model
crf_test -m model dev.txt  > result.txt 
  1. Look at result.txt. one column will contain your hand-labeled data and other - machine predicted labels. You can then compare these, compute accuracy etc. After that you can feed new unlabeled data into crf_test and get your labels.

As I said, this won't be perfect, but I will be very surprised if that won't be reasonable good (I actually solved very similar task not long ago) and certanly better just using few keywords/templates

ENDNOTE: this ignores many things and some best-practices in solving such tasks, won't be good for academic research, not 100% guaranteed to work, but still useful for this and many similar problems as relatively quick solution.

Denis Tarasov
  • 1,051
  • 6
  • 8
  • You have to check what you say... "glove", "regulator", "process" are not named entities. – emiguevara Feb 18 '15 at 11:43
  • @emiguevara "[Named entity recognition](http://en.wikipedia.org/wiki/Named-entity_recognition) (also known as entity identification, entity chunking and entity extraction) is a subtask of information extraction that seeks to locate and classify elements in text into pre-defined categories". Categories can be anything. In this case category is "thing/product". Method that I suggest will work for this task. – Denis Tarasov Feb 18 '15 at 12:26
  • I stand by my previous comment, no need to undust Wikipedia. "Glove", "regulator", "process" are not named entities in the sense that is common in the NER literature. "Obama", "the president of the US", "Boston", "IBM" are usual examples of what you get from NER. – emiguevara Feb 18 '15 at 12:42
  • @emiguevara this is dispute over terminology. Ok, lets say this is the instance of sequence labeling problem. This won't change much – Denis Tarasov Feb 18 '15 at 12:45