1

I'm a beginner in the field of artificial intelligence... I can use GATE or any other Natural Language Processing but I don't have an answer for this :

Do you know how to evaluate how 2 sentences can be close? even with a large data set?

Do you have any recommendations? I can use the number of permutation, the lengh, the number of tokens, metaphone them, etc... but I don't know what test I should use.

My goal is : - "Hello Jarvis" - "Hello Romain, how are you"

- "Hello arvis"
- "Hello Romain, how are you"

- "Hello mister Swift"
- I don't know what you are expecting, is this like "Hello Jarvis" ?
- Yes
- Ok, Hello Romain, How are you?

- "Hello mister swift, how are you?"
- I don't know what are you expecting.

Exemple

By 1, 2, 3 or n is just an example of similarity scale.

Basic

- "Hello IA" is closed to
   - "Hello IA" by 0
   - "Hello AI" by 1 

- "Hello Jarvis" is closed to 
   - "Hello AI" by 2 
   - "Hello IA" by 2

- "Hello! mister Swift" is closed to
   - "Hello AI" by 3
   - "Hello IA" by 3
   - "Hello Jarvis" by 2

Less Basic

- "Hello IA" is (token length, token word, grammatically, syntactically) closed to
   - "Hello IA" by (0,0,0,0)
   - "Hello AI" by (0,1,0,0) 

- "Hello Jarvis" is closed to 
   - "Hello AI" by (0,2,1,1) 
   - "Hello IA" by (0,2,1,1)

- "Hello! mister Swift" is closed to
   - "Hello AI" by (1,2,2,2)
   - "Hello IA" by (1,2,2,2)
   - "Hello Jarvis" by (1,2,2,2)
merlin
  • 122
  • 2
  • 6

2 Answers2

0

If you are ready to learn hard-core NLP, you may use a classifier for this task. Have a look for instance at Stanford NLP (Java) or NLTK (Python).

If you want to keep things simple and use an out-of-the-box solution, have a look at the Wit.ai API it does exactly what you need, and more.

Blacksad
  • 14,906
  • 15
  • 70
  • 81
  • hard-core <3 I check Wit, it's really good but my project need to work without internet. Anyway, it's just a little part (human input) of a personal project so I can learn. **My problem is really "what to use?" and "how to combine?"** I found [link](http://sourceforge.net/projects/semantics/) but it's just for semantic means (classifier4j) and I think it's not really appropriate for a large set of data. ps : I have a look to the stanford NLP, I think I will include it, great lib. – merlin Sep 08 '14 at 06:22
0

One way to determine string similarity is to use String kernels. There's a good paper by Lodhi et al explaining how this works:

http://machinelearning.wustl.edu/mlpapers/paper_files/LodhiSSCW02.pdf

In order to create a classifier using CoreNLP you would have to create features for the string, such as n-grams, lemmas or similar.

langkilde
  • 1,473
  • 1
  • 20
  • 37