I am working in analyzing the pairwise correlation of words appearing in user reviews and plotting them in the form of the correlation network graph.
My sample data is as follows:
review_corwords
Label Rating word
1 1 1 connect
1.1 1 1 gps
1.2 1 1 app
1.3 1 1 connect
1.4 1 1 gps
1.5 1 1 matter
1.6 1 1 long
1.7 1 1 gps
1.8 1 1 set
1.9 1 1 high
1.10 1 1 accuracy
1.11 1 1 setting
1.12 1 1 appear
1.13 1 1 set
1.14 1 1 app
1.15 1 1 useless
1.16 1 1 cant
1.17 1 1 track
1.18 1 1 workout
2 1 5 wish
2.1 1 5 would
2.2 1 5 interest
2.3 1 5 google
2.4 1 5 provide
2.5 1 5 weekly
2.6 1 5 monthly
2.7 1 5 summary
3 1 1 useless
Then I perform this:
library(widyr)
# count words co-occuring within a label
word_pairs <- review_corwords %>%
pairwise_count(word, Label,sort = TRUE)
whose output is as follows:
# A tibble: 16,333,722 x 3
item1 item2 n
<chr> <chr> <dbl>
1 gps connect 1
2 app connect 1
3 matter connect 1
4 long connect 1
5 set connect 1
However, when I try to perform a correlation analysis of the same I get the following:
word_cors <- review_corwords %>%
group_by(word) %>%
pairwise_cor(word, Label, sort = TRUE)
# A tibble: 16,333,722 x 3
item1 item2 correlation
<chr> <chr> <dbl>
1 gps connect NaN
2 app connect NaN
3 matter connect NaN
4 long connect NaN
5 set connect NaN
6 high connect NaN
I need to find the right correlation values for the word pairs, please help.