When creating a new feature of similarity in ham vs spam case, should I include the similarity of spam with itself in the average of samp similarity?

Asked Apr 08 '20 at 23:43

Active Apr 09 '20 at 02:03

Viewed 31 times

I want to improve my model by adding a new feature column to my data, the data of ham and spam texts. I have already created the square Cosine similarity matrix between all the texts, the diagonal of the matrix are 1s = cos(0).

I extract all the spam text index in the training data, and I created the column of similarity, for each cell in the column, I add the individual similarity between this text and all the spam and average them.

My question: for the text that is ham, it makes sense to do above. But for the text are spam, when calculating the similarity, should I exclude the similarity between itself? Will it causes data leakage?

If we have n text of sample size, I represent the similarity value of ham_1 as average(ham_1~spam_1, ham_1~spam_2, ..., ham_1~spam_n)

My question is:

For spam text spam_5, similarity value = average(spam_5~spam_1, spam_5~spam_2, ..., spam_5~spam_5, ..., spam_5~spam_n)

For spam text spam_5, similarity value = average(spam_5~spam_1, spam_5~spam_2, ..., ~~spam_5~spam_5~~, ..., spam_5~spam_n)

edited Apr 09 '20 at 02:03

asked Apr 08 '20 at 23:43

yshi50

When creating a new feature of similarity in ham vs spam case, should I include the similarity of spam with itself in the average of samp similarity?

0 Answers0