-1

I need to measure similarity between two profiles, in which we will have description about them in words. Now using the profile data, i need to find the similarity between them. Can you suggest me a way.?

Srini
  • 3,334
  • 6
  • 29
  • 64

2 Answers2

0

You can do a literature review on this problem, decompose your problem to sub-problems, or apply existing solutions based on how you view the problem. For example, you can apply existing sentence similarity measures if you view this problem as application of text clustering.

Keyword matching seems to be the simplest solution. This baseline only requires you to identify the named-entities and to compute the matches. You can do some term weighting in the process.

The complexity of the solution depends on the structure of the texts (are the profiles more like LinkedIn profiles, or resume?) and the likelihood of false positives (are names and birth dates always present and are they sufficient to establish similarity?). You did not provide an example for us to see.

Kenston Choi
  • 2,862
  • 1
  • 27
  • 37
0

There is not really a utility for this in OpenNLP. I suggest you take a simple approach first and work from there. The simple approach I recommend is to vectorize each profile description, and then use a standard similarity measure to compare them. Here is an example of using cosine similarity. The next problem you will likely have is trying to compare them all to each other... then you will have entered the realm where you will need to do clustering. You should also think about noise removal and stopwords and possibly stemming to produce better tokens. This example is just an illustration, the most important decision you will make is what to add to your vector.

import java.util.HashSet;
import java.util.Set;
import java.util.SortedMap;
import java.util.TreeMap;

/**
 *
 * Crudely compares two strings
 */
public class SimpleProfileComparer {

  public static void main(String[] args) {
    String[] profileA = "bob likes to ride bikes and hiking".split(" ");
    String[] profileB = "jim likes bikes and also enjoys hiking".split(" ");;
    SortedMap<String, Double> a = new TreeMap<>();
    for (String string : profileA) {
      a.put(string, 1d);
    }
    SortedMap<String, Double> b = new TreeMap<>();
    for (String string : profileB) {
      b.put(string, 1d);
    }
    Set<String>keys = new HashSet<>();
    keys.addAll(a.keySet());
    keys.addAll(b.keySet());
    for (String string : keys) {
      if(!a.containsKey(string)){
        a.put(string, 0d);
      }
      if(!b.containsKey(string)){
        b.put(string, 0d);
      }
    }
    Double compare = compare(a, b);
    System.out.println(compare);
  }

  public static Double compare(SortedMap<String, Double> a, SortedMap<String, Double> b) {
    //both vectors must be of the same schema (normed prior to this call)
    if (a.keySet().size() != b.keySet().size()) {
      throw new IllegalArgumentException("vectors must be the same length");
    }
    double magA = 0;
    double magB = 0;
    double dotProd = 0;
    for (String key : a.keySet()) {
      Double intA = a.get(key);
      Double intB = b.get(key);
      /*
       * sum of squares calcs
       */
      magA += intA * intA;
      magB += intB * intB;
      /**
       * dot prod calc
       */
      dotProd += intA * intB;
    }
    magA = Math.sqrt(magA);
    magB = Math.sqrt(magB);
    Double similarity = dotProd / (magA * magB);
    return similarity;

  }

}
Mark Giaconia
  • 3,844
  • 5
  • 20
  • 42