1

My understanding of the pairwise_similarity function in R was that it compared every item to every other.

So for example, if you had 3 text items:

  • Item 1 would be compared to item 2 and 3

  • Item 2 would be compared to item 1 and 3

  • Item 3 would be compared to item 1 and 2

However this does not seem to happen here:

Here is my data:

d <- data.frame(column_id=1:3, description= c("red and yellow", "yellow and blue", "green and black"))

d

 column_id     description
         1    red and yellow
         2    yellow and blue
         3    green and black   # notice how item 3 has no common words with the other two


# unnest the words and remove stop words 

d_un_nest  <- d %>%
              tidytext::unnest_tokens(output = "word",
                                      input = "description",
                                      token = "words") %>%
                        dplyr::anti_join(tidytext::stop_words) %>%
                        dplyr::count(column_id, word, sort = TRUE) %>%
                        tidytext::bind_tf_idf(word, column_id, n)

# complete pairwise similarity

d_similarity <- widyr::pairwise_similarity(d_un_nest, column_id, word, tf_idf)


d_similarity

# A tibble: 2 × 3

  item1 item2 similarity
    2     1      0.120
    1     2      0.120

Notice how item 3 is not compared to 1 and 2? Why is this? If I add in a word to item 3 which is common to 1 and 3, it does add in a few more comparisons, but again not all:

d <- data.frame(column_id=1:3, description= c("red and yellow", "yellow and blue", "blue and black"))

d


column_id     description
        1     red and yellow
        2     yellow and blue
        3     blue and black

d_un_nest  <- d %>%
              tidytext::unnest_tokens(output = "word",
                                      input = "description",
                                      token = "words") %>%
                        dplyr::anti_join(tidytext::stop_words) %>%
                        dplyr::count(column_id, word, sort = TRUE) %>%
                        tidytext::bind_tf_idf(word, column_id, n)



d_similarity <- widyr::pairwise_similarity(d_un_nest, column_id, word, tf_idf)


d_similarity

# A tibble: 4 × 3
  item1 item2 similarity
    2     1      0.245
    1     2      0.245
    3     2      0.245   # 3 not compared to 1 at any point - why?
    2     3      0.245

Is my understanding of pairwise similarity lacking? Unless as a default, if two text chunks have zero words in common, so their similarity is zero, the row is omitted? Does anyone know if this could be the answer?

fe108
  • 161
  • 7
  • Could you add the content of `d` in the second code chunk. That would clarify the question a little bit more. – KoenV Jun 22 '22 at 14:01

1 Answers1

0

I was unable to find documentation for this.

It is not "similarity == 0", that makes the rows disappear. Words that are present in all items have idf = 0, hence tf-idf is zero as well. So, if we add a "common" word, e.g. pink to all three items:

######################################################
######################################################
d <- data.frame(column_id = 1:3, 
                description = c("red and yellow pink", 
                                "yellow and blue pink", 
                                "green and black pink"))   ### here
d_un_nest <- d %>%
  tidytext::unnest_tokens(output = "word",
                          input = "description",
                          token = "words") %>%
  dplyr::anti_join(tidytext::stop_words) %>%
  dplyr::count(column_id, word, sort = TRUE) %>%
  tidytext::bind_tf_idf(word, column_id, n)
(d_similarity <- widyr::pairwise_similarity(d_un_nest, column_id, word, tf_idf))

Gives:

# A tibble: 6 × 3
  item1 item2 similarity
  <int> <int>      <dbl>
1     2     1      0.120
2     3     1      0    
3     1     2      0.120
4     3     2      0    
5     1     3      0    
6     2     3      0   

If we replace the "common" pink with the "unique" brown, such that the 3rd item has no common words with item 1 or item 2:

######################################################
######################################################
d <- data.frame(column_id = 1:3, 
                description = c("red and yellow pink", 
                                "yellow and blue pink", 
                                "green and black brown")) ### here

d_un_nest <- d %>%
  tidytext::unnest_tokens(output = "word",
                          input = "description",
                          token = "words") %>%
  dplyr::anti_join(tidytext::stop_words) %>%
  dplyr::count(column_id, word, sort = TRUE) %>%
  tidytext::bind_tf_idf(word, column_id, n)

(d_similarity <- widyr::pairwise_similarity(d_un_nest, column_id, word, tf_idf))

Gives:

# A tibble: 2 × 3
  item1 item2 similarity
  <int> <int>      <dbl>
1     2     1      0.214
2     1     2      0.214
KoenV
  • 4,113
  • 2
  • 23
  • 38
  • Yes I have also been struggling to find any documentation. So are you thinking if a text chunk has no common words with the others, it not compared? – fe108 Jun 22 '22 at 13:40
  • Yes, that is what I think now, but I am not sure. It would be nice to have this behavior documented. – KoenV Jun 22 '22 at 13:42
  • yes I agree. :) – fe108 Jun 22 '22 at 13:44
  • 1
    On my original data, it has 92 rows of text data. Using paste0 I put a common word at the start of the text chunk. (my data is about risk, so I just put the word "risk" at the start). When I ran the code again, I ended up with 92 x 91 = 8,372 rows of data which is what I expected originally. Further evidence that the function eliminates comparisons which have no common words. – fe108 Jun 22 '22 at 13:53
  • I agree. Nice experiment :-) – KoenV Jun 22 '22 at 13:59