How to use Parts-of-Speech to evaluate semantic text similarity?

Question

I'm trying to write a program to evaluate semantic similarity between texts. I have already compared n-gram frequencies between texts (a lexical measure). I wanted something a bit less shallow than this, and I figured that looking at similarity in sentence construction would be one way to evaluate text similarity.

However, all I can figure out how to do is to count the POS (for example, 4 nouns per text, 2 verbs, etc.). This is then similar to just counting n-grams (and actually works less well than the ngrams).

postags = nltk.pos_tag(tokens)
self.pos_freq_dist = Counter(tag for word,tag in postags)
for pos, freq in self.pos_freq_dist.iteritems():
    self.pos_freq_dist_relative[pos] = freq/self.token_count    #normalise pos freq by token counts

Lots of people (Pearsons, ETS Research, IBM, academics, etc.) use Parts-of-Speech for deeper measures, but no one says how they have done it. How can Parts-of-Speech be used for a 'deeper' measure of semantic text similarity?

They can't, not on their own, anyway. Part-of-speech tags generally tell you something about syntax, not about semantics, so they're not going to be useful in comparing meaning. Think about what information you take away from "cat" vs. NOUN. Does knowing that two texts contain verbs tell you anything about whether they are semantically similar? — aab, Jul 12 '12 at 19:50
I agree with @aab as well. Perhaps, POS can be used as heuristics in determining lack of similarity (false entailment) rather than similarity. But the recall score of such approach may be very low as not to drag precision down. — Kenston Choi, Jul 13 '12 at 14:10

score 1 · Answer 1 · answered Jul 29 '12 at 19:39

A more sophisticated tagger is required such as http://phpir.com/part-of-speech-tagging/. You will need to write algorithms and create word banks to determine the meaning or intention of sentences. Semantic analysis is artificial intelligence.

Nouns and capitalized nouns will be the subjects of the content. Adjectives will give some hint as to the polarity of the content. Vagueness, clarity, power, weakness, the types of words used. The possibilities are endless.

score 0 · Answer 2 · edited Oct 23 '18 at 12:17

0

Take a look at chapter 6 of the NLTK Book. It should give you plenty of ideas for features you can use to classify text.

edited Oct 23 '18 at 12:17

Bram Vanroy

27,032
24
137
239

answered Jul 12 '12 at 18:08

alexis

48,685
16
101
161

How to use Parts-of-Speech to evaluate semantic text similarity?

2 Answers2