pairwise_similarity in R: not all rows compared to all others

Question

My understanding of the pairwise_similarity function in R was that it compared every item to every other.

So for example, if you had 3 text items:

Item 1 would be compared to item 2 and 3
Item 2 would be compared to item 1 and 3
Item 3 would be compared to item 1 and 2

However this does not seem to happen here:

Here is my data:

d <- data.frame(column_id=1:3, description= c("red and yellow", "yellow and blue", "green and black"))

d

 column_id     description
         1    red and yellow
         2    yellow and blue
         3    green and black   # notice how item 3 has no common words with the other two


# unnest the words and remove stop words 

d_un_nest  <- d %>%
              tidytext::unnest_tokens(output = "word",
                                      input = "description",
                                      token = "words") %>%
                        dplyr::anti_join(tidytext::stop_words) %>%
                        dplyr::count(column_id, word, sort = TRUE) %>%
                        tidytext::bind_tf_idf(word, column_id, n)

# complete pairwise similarity

d_similarity <- widyr::pairwise_similarity(d_un_nest, column_id, word, tf_idf)


d_similarity

# A tibble: 2 × 3

  item1 item2 similarity
    2     1      0.120
    1     2      0.120

Notice how item 3 is not compared to 1 and 2? Why is this? If I add in a word to item 3 which is common to 1 and 3, it does add in a few more comparisons, but again not all:

d <- data.frame(column_id=1:3, description= c("red and yellow", "yellow and blue", "blue and black"))

d


column_id     description
        1     red and yellow
        2     yellow and blue
        3     blue and black

d_un_nest  <- d %>%
              tidytext::unnest_tokens(output = "word",
                                      input = "description",
                                      token = "words") %>%
                        dplyr::anti_join(tidytext::stop_words) %>%
                        dplyr::count(column_id, word, sort = TRUE) %>%
                        tidytext::bind_tf_idf(word, column_id, n)



d_similarity <- widyr::pairwise_similarity(d_un_nest, column_id, word, tf_idf)


d_similarity

# A tibble: 4 × 3
  item1 item2 similarity
    2     1      0.245
    1     2      0.245
    3     2      0.245   # 3 not compared to 1 at any point - why?
    2     3      0.245

Is my understanding of pairwise similarity lacking? Unless as a default, if two text chunks have zero words in common, so their similarity is zero, the row is omitted? Does anyone know if this could be the answer?

Could you add the content of `d` in the second code chunk. That would clarify the question a little bit more. — KoenV, Jun 22 '22 at 14:01

KoenV · Answer 1 · 2022-06-22T13:48:11.607

I was unable to find documentation for this.

It is not "similarity == 0", that makes the rows disappear. Words that are present in all items have idf = 0, hence tf-idf is zero as well. So, if we add a "common" word, e.g. pink to all three items:

######################################################
######################################################
d <- data.frame(column_id = 1:3, 
                description = c("red and yellow pink", 
                                "yellow and blue pink", 
                                "green and black pink"))   ### here
d_un_nest <- d %>%
  tidytext::unnest_tokens(output = "word",
                          input = "description",
                          token = "words") %>%
  dplyr::anti_join(tidytext::stop_words) %>%
  dplyr::count(column_id, word, sort = TRUE) %>%
  tidytext::bind_tf_idf(word, column_id, n)
(d_similarity <- widyr::pairwise_similarity(d_un_nest, column_id, word, tf_idf))

Gives:

# A tibble: 6 × 3
  item1 item2 similarity
  <int> <int>      <dbl>
1     2     1      0.120
2     3     1      0    
3     1     2      0.120
4     3     2      0    
5     1     3      0    
6     2     3      0

If we replace the "common" pink with the "unique" brown, such that the 3rd item has no common words with item 1 or item 2:

######################################################
######################################################
d <- data.frame(column_id = 1:3, 
                description = c("red and yellow pink", 
                                "yellow and blue pink", 
                                "green and black brown")) ### here

d_un_nest <- d %>%
  tidytext::unnest_tokens(output = "word",
                          input = "description",
                          token = "words") %>%
  dplyr::anti_join(tidytext::stop_words) %>%
  dplyr::count(column_id, word, sort = TRUE) %>%
  tidytext::bind_tf_idf(word, column_id, n)

(d_similarity <- widyr::pairwise_similarity(d_un_nest, column_id, word, tf_idf))

Gives:

# A tibble: 2 × 3
  item1 item2 similarity
  <int> <int>      <dbl>
1     2     1      0.214
2     1     2      0.214

Yes I have also been struggling to find any documentation. So are you thinking if a text chunk has no common words with the others, it not compared? — fe108, Jun 22 '22 at 13:40
Yes, that is what I think now, but I am not sure. It would be nice to have this behavior documented. — KoenV, Jun 22 '22 at 13:42
On my original data, it has 92 rows of text data. Using paste0 I put a common word at the start of the text chunk. (my data is about risk, so I just put the word "risk" at the start). When I ran the code again, I ended up with 92 x 91 = 8,372 rows of data which is what I expected originally. Further evidence that the function eliminates comparisons which have no common words. — fe108, Jun 22 '22 at 13:53

pairwise_similarity in R: not all rows compared to all others

1 Answers1