My understanding of the pairwise_similarity function in R was that it compared every item to every other.
So for example, if you had 3 text items:
Item 1 would be compared to item 2 and 3
Item 2 would be compared to item 1 and 3
Item 3 would be compared to item 1 and 2
However this does not seem to happen here:
Here is my data:
d <- data.frame(column_id=1:3, description= c("red and yellow", "yellow and blue", "green and black"))
d
column_id description
1 red and yellow
2 yellow and blue
3 green and black # notice how item 3 has no common words with the other two
# unnest the words and remove stop words
d_un_nest <- d %>%
tidytext::unnest_tokens(output = "word",
input = "description",
token = "words") %>%
dplyr::anti_join(tidytext::stop_words) %>%
dplyr::count(column_id, word, sort = TRUE) %>%
tidytext::bind_tf_idf(word, column_id, n)
# complete pairwise similarity
d_similarity <- widyr::pairwise_similarity(d_un_nest, column_id, word, tf_idf)
d_similarity
# A tibble: 2 × 3
item1 item2 similarity
2 1 0.120
1 2 0.120
Notice how item 3 is not compared to 1 and 2? Why is this? If I add in a word to item 3 which is common to 1 and 3, it does add in a few more comparisons, but again not all:
d <- data.frame(column_id=1:3, description= c("red and yellow", "yellow and blue", "blue and black"))
d
column_id description
1 red and yellow
2 yellow and blue
3 blue and black
d_un_nest <- d %>%
tidytext::unnest_tokens(output = "word",
input = "description",
token = "words") %>%
dplyr::anti_join(tidytext::stop_words) %>%
dplyr::count(column_id, word, sort = TRUE) %>%
tidytext::bind_tf_idf(word, column_id, n)
d_similarity <- widyr::pairwise_similarity(d_un_nest, column_id, word, tf_idf)
d_similarity
# A tibble: 4 × 3
item1 item2 similarity
2 1 0.245
1 2 0.245
3 2 0.245 # 3 not compared to 1 at any point - why?
2 3 0.245
Is my understanding of pairwise similarity lacking? Unless as a default, if two text chunks have zero words in common, so their similarity is zero, the row is omitted? Does anyone know if this could be the answer?