1

I'm trying to write a program to evaluate semantic similarity between texts. I have already compared n-gram frequencies between texts (a lexical measure). I wanted something a bit less shallow than this, and I figured that looking at similarity in sentence construction would be one way to evaluate text similarity.

However, all I can figure out how to do is to count the POS (for example, 4 nouns per text, 2 verbs, etc.). This is then similar to just counting n-grams (and actually works less well than the ngrams).

postags = nltk.pos_tag(tokens)
self.pos_freq_dist = Counter(tag for word,tag in postags)
for pos, freq in self.pos_freq_dist.iteritems():
    self.pos_freq_dist_relative[pos] = freq/self.token_count    #normalise pos freq by token counts             

Lots of people (Pearsons, ETS Research, IBM, academics, etc.) use Parts-of-Speech for deeper measures, but no one says how they have done it. How can Parts-of-Speech be used for a 'deeper' measure of semantic text similarity?

Zach
  • 4,624
  • 13
  • 43
  • 60
  • 1
    They can't, not on their own, anyway. Part-of-speech tags generally tell you something about syntax, not about semantics, so they're not going to be useful in comparing meaning. Think about what information you take away from "cat" vs. NOUN. Does knowing that two texts contain verbs tell you anything about whether they are semantically similar? – aab Jul 12 '12 at 19:50
  • I agree with @aab as well. Perhaps, POS can be used as heuristics in determining lack of similarity (false entailment) rather than similarity. But the recall score of such approach may be very low as not to drag precision down. – Kenston Choi Jul 13 '12 at 14:10

2 Answers2

1

A more sophisticated tagger is required such as http://phpir.com/part-of-speech-tagging/. You will need to write algorithms and create word banks to determine the meaning or intention of sentences. Semantic analysis is artificial intelligence.

Nouns and capitalized nouns will be the subjects of the content. Adjectives will give some hint as to the polarity of the content. Vagueness, clarity, power, weakness, the types of words used. The possibilities are endless.

user723220
  • 817
  • 3
  • 12
  • 20
0

Take a look at chapter 6 of the NLTK Book. It should give you plenty of ideas for features you can use to classify text.

Bram Vanroy
  • 27,032
  • 24
  • 137
  • 239
alexis
  • 48,685
  • 16
  • 101
  • 161