1

I want to improve my model by adding a new feature column to my data, the data of ham and spam texts. I have already created the square Cosine similarity matrix between all the texts, the diagonal of the matrix are 1s = cos(0).

I extract all the spam text index in the training data, and I created the column of similarity, for each cell in the column, I add the individual similarity between this text and all the spam and average them.

My question: for the text that is ham, it makes sense to do above. But for the text are spam, when calculating the similarity, should I exclude the similarity between itself? Will it causes data leakage?

If we have n text of sample size, I represent the similarity value of ham_1 as average(ham_1~spam_1, ham_1~spam_2, ..., ham_1~spam_n)

My question is:

For spam text spam_5, similarity value = average(spam_5~spam_1, spam_5~spam_2, ..., spam_5~spam_5, ..., spam_5~spam_n)

Or

For spam text spam_5, similarity value = average(spam_5~spam_1, spam_5~spam_2, ..., spam_5~spam_5, ..., spam_5~spam_n)

yshi50
  • 11
  • 2

0 Answers0