2

I need to do an experiment and I am new in NLP. I have read books that explain the theoritical issues but when it comes to practical I found it hard to find a guide. so please who knows anything in NLP especially the practical issues tell me and point me to the right path because I feel I am lost (useful books, useful tools and useful websites)

what I am trying to do is to take a text and find specific words for example animals such as dogs, cats,...etc in it then I need to extract this word and 2 words on each side. For example

I was watching TV with my lovely cat last night.

the extracted text will be

(my lovely cat last night)

This will be my training example to the machine tool

Q1: there will be around 100 training examples similar to what I explained above. I used tocknizer to extracts words but how can I extract specific words(for our example all types of animals) with 2 words on each side. do I need to use tags for example or what is your idea?

Q2: If I have these training examples how can I prepare appropriate datasets that I can give it to the machine tool to train it? what should I write in this dataset to specify the animal and should I need to give other features? and how can I arrange it in a dataset .

many words from you might help me a lot please do not hesitate to tell what you know

up-up
  • 65
  • 8

2 Answers2

1

What you are attempting to do is sometimes known as "Ontology Acquisition" or "Automated Ontology", and is a pretty difficult problem. Most approaches come down to "Words that are similar will tend to be used in similar contexts." The problem with this is that while there are algorithms that successfully extract semantically meaningful relationships from data such as yours, going from "Here are a bunch of terms that statistically share a common distribution with your seed terms" to "your seed terms are animal names, here are some other animal names" is challenging. For example, training on cat,dog, snake, bird, might end up giving you results like "mammal, dachshund, creature, biped" are used in similar contexts, but depending on your requirements, may not be exactly what you need.

Below is a link to a research paper that implemented exactly what you are trying to do. They describe their approach to data representation and algorithms used, and perform with at least some level of success on the animal name problem. In addition, tracking down their references may be a fruitful exercise..

http://www.cl.cam.ac.uk/~ah433/cluk.pdf

bdk
  • 4,769
  • 29
  • 33
-1

Let me begin by saying that being a self-taught engineer when I started working in NLP several years ago, I completely understand your frustration. I would suggest that you read the NLTK book which is a wonderful introduction to applied NLP. In particular, read Chapters 3-7 which deal with processing raw text data to extract information and use it for tagging. The book is available online.

With regards to your specific question:

I think that it might be much easier to create a small list of animals and then extract sentences from a corpus that contain these animal names. Wikipedia sentences is one obvious example. You can build your corpus using this method because you already know the names of the animals in each sentence.

// PSEUDO CODE
Dictionary animals = ["dog","dogs,"cat","cats","pig","pigs","cow","cows","lion","lions","lioness","lionesses"];
String[] sentences = getWikipediaSentences();
for(sent: sentences){
  for(token: Tokenizer.getTokens(sent)){
    if(animals.contains(token){
    addSentenceToCorpus(sent)
    } // else ignore sentence
  }
}

You can then train your algorithm on these sentences so that you can use the trained model to extract newer animal names. There are caveats with this approach since your "training data" is artificially collected but it will be a good first experience nonetheless.

hashable
  • 3,791
  • 2
  • 23
  • 22