I'm just beginning to use R for text mining and have come across a problem.
I have successfully charted tf_idf for single words in my dataset which includes 3 different columns (positive, negative, and bank) - the column name is 'Box'.
I am trying to do the same for bigrams and trigrams and using the same code:
Trigram_tibble %>%
arrange(desc(tf_idf)) %>%
mutate(trigram = factor(trigram, levels = rev(unique(trigram)))) %>%
group_by(Box) %>%
top_n(10, tf_idf) %>%
ungroup %>%
ggplot(aes(trigram, tf_idf, fill = Box)) +
geom_col(show.legend = FALSE) +
labs(x = NULL, y = "tf-idf") +
facet_wrap(~Box, ncol = 2, scales = "free") +
coord_flip()
I have discovered (I think) that the 'top_n' function returns the trigrams which have the top ranking, and that it automatically uses the last variable in the tibble(in my case this is tf_idf, and I've chosen n-10). However, when running this for bigrams I am only able to produce charts that seem to have several hundred (thousand??) bigrams along the y axis.
In the picture you can see that the negative variable seems fine (I've redacted it for data protection), but the other two are not!
I took this code from the tidy text mining book originally.
EDIT - ADDING A SAMPLE OF THE DATAMy best guess now is that the 'top_n' tf_idf scores happen to have many that are exactly the same. In which case I'm now not sure this is a useful calculation and I'm wondering why it happened to work so well in the tidy text book, but not for my data.
EDIT 2
I cut down the Trigram_tibble to 50 observations and this is the output of dput(Trigram_tibble) (I've obscured the survey response text trigrams)
a<-Trigram_tibble [1:50, 1:8] dput(a) structure(list(Respondent = c(1294L, 2693L, 42L, 463L, 463L, 1481L, 1706L, 1891L, 1917L, 2442L, 2693L, 3590L, 3590L, 3916L, 4454L, 4682L, 5996L, 6283L, 6283L, 6568L, 9101L, 2L, 3L, 4L, 4L, 4L, 8L, 12L, 12L, 13L, 13L, 13L, 13L, 13L, 13L, 13L, 13L, 18L, 18L, 18L, 18L, 20L, 21L, 21L, 21L, 21L, 21L, 21L, 21L, 21L ), Box = c("Positive", "Negative", "Negative", "Negative", "Negative", "Negative", "Bank", "Positive", "Negative", "Negative", "Negative", "Bank", "Bank", "Negative", "Positive", "Negative", "Negative", "Negative", "Negative", "Negative", "Negative", "Negative", "Negative", "Negative", "Negative", "Negative", "Negative", "Bank", "Bank", "Negative", "Negative", "Negative", "Negative", "Negative", "Positive", "Positive", "Positive", "Negative", "Negative", "Negative", "Negative", "Negative", "Bank", "Bank", "Bank", "Negative", "Negative", "Negative", "Negative", "Negative"), trigram = c("xxx xxx xxx", "xxx xxx xxx", "xxx xxx xxx", "xxx xxx xxx", "xxx xxx xxx", "xxx xxx xxx", "xxx xxx xxx", "xxx xxx xxx", "xxx xxx xxx", "xxx xxx xxx", "xxx xxx xxx", "xxx xxx xxx", "xxx xxx xxx", "xxx xxx xxx", "xxx xxx xxx", "xxx xxx xxx", "xxx xxx xxx", "xxx xxx xxx", "xxx xxx xxx", "xxx xxx xxx", "xxx xxx xxx", "xxx xxx xxx", "xxx xxx xxx", "xxx xxx xxx", "xxx xxx xxx", "xxx xxx xxx", "xxx xxx xxx", "xxx xxx xxx", "xxx xxx xxx", "xxx xxx xxx", "xxx xxx xxx", "xxx xxx xxx", "xxx xxx xxx", "xxx xxx xxx", "xxx xxx xxx", "xxx xxx xxx", "xxx xxx xxx", "xxx xxx xxx", "xxx xxx xxx", "xxx xxx xxx", "xxx xxx xxx", "xxx xxx xxx", "xxx xxx xxx", "xxx xxx xxx", "xxx xxx xxx", "xxx xxx xxx", "xxx xxx xxx", "xxx xxx xxx", "xxx xxx xxx", "xxx xxx xxx"), n = c(4L, 3L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), total = c(1714L, 2899L, 2899L, 2899L, 2899L, 2899L, 836L, 1714L, 2899L, 2899L, 2899L, 836L, 836L, 2899L, 1714L, 2899L, 2899L, 2899L, 2899L, 2899L, 2899L, 2899L, 2899L, 2899L, 2899L, 2899L, 2899L, 836L, 836L, 2899L, 2899L, 2899L, 2899L, 2899L, 1714L, 1714L, 1714L, 2899L, 2899L, 2899L, 2899L, 2899L, 836L, 836L, 836L, 2899L, 2899L, 2899L, 2899L, 2899L), tf = c(0.00233372228704784, 0.00103483959986202, 0.000689893066574681, 0.000689893066574681, 0.000689893066574681, 0.000689893066574681, 0.00239234449760766, 0.00116686114352392, 0.000689893066574681, 0.000689893066574681, 0.000689893066574681, 0.00239234449760766, 0.00239234449760766, 0.000689893066574681, 0.00116686114352392, 0.000689893066574681, 0.000689893066574681, 0.000689893066574681, 0.000689893066574681, 0.000689893066574681, 0.000689893066574681, 0.00034494653328734, 0.00034494653328734, 0.00034494653328734, 0.00034494653328734, 0.00034494653328734, 0.00034494653328734, 0.00119617224880383, 0.00119617224880383, 0.00034494653328734, 0.00034494653328734, 0.00034494653328734, 0.00034494653328734, 0.00034494653328734, 0.00058343057176196, 0.00058343057176196, 0.00058343057176196, 0.00034494653328734, 0.00034494653328734, 0.00034494653328734, 0.00034494653328734, 0.00034494653328734, 0.00119617224880383, 0.00119617224880383, 0.00119617224880383, 0.00034494653328734, 0.00034494653328734, 0.00034494653328734, 0.00034494653328734, 0.00034494653328734), idf = c(-2.07944154167984, 0.405465108108164, 1.09861228866811, 1.09861228866811, 1.09861228866811, 1.09861228866811, 0, 1.09861228866811, 1.09861228866811, -2.07944154167984, 1.09861228866811, 0.405465108108164, 1.09861228866811, -0.693147180559945, 1.09861228866811, 1.09861228866811, 0, 1.09861228866811, 1.09861228866811, -1.29928298413026, 1.09861228866811, 1.09861228866811, 1.09861228866811, 1.09861228866811, 1.09861228866811, 1.09861228866811, 1.09861228866811, 1.09861228866811, 1.09861228866811, 1.09861228866811, 1.09861228866811, 1.09861228866811, 1.09861228866811, 1.09861228866811, 1.09861228866811, 1.09861228866811, 1.09861228866811, 1.09861228866811, 1.09861228866811, 1.09861228866811, 1.09861228866811, 1.09861228866811, 1.09861228866811, 1.09861228866811, 1.09861228866811, 1.09861228866811, 1.09861228866811, 1.09861228866811, 1.09861228866811, 1.09861228866811), tf_idf = c(-0.00485283907043135, 0.000419591350232664, 0.000757925000805871, 0.000757925000805871, 0.000757925000805871, 0.000757925000805871, 0, 0.0012819279914447, 0.000757925000805871, -0.00143459230195228, 0.000757925000805871, 0.00097001222035446, 0.00262825906379931, -0.000478197433984095, 0.0012819279914447, 0.000757925000805871, 0, 0.000757925000805871, 0.000757925000805871, -0.000896366322269928, 0.000757925000805871, 0.000378962500402935, 0.000378962500402935, 0.000378962500402935, 0.000378962500402935, 0.000378962500402935, 0.000378962500402935, 0.00131412953189965, 0.00131412953189965, 0.000378962500402935, 0.000378962500402935, 0.000378962500402935, 0.000378962500402935, 0.000378962500402935, 0.000640963995722351, 0.000640963995722351, 0.000640963995722351, 0.000378962500402935, 0.000378962500402935, 0.000378962500402935, 0.000378962500402935, 0.000378962500402935, 0.00131412953189965, 0.00131412953189965, 0.00131412953189965, 0.000378962500402935, 0.000378962500402935, 0.000378962500402935, 0.000378962500402935, 0.000378962500402935)), row.names = c(NA, -50L), class = c("tbl_df", "tbl", "data.frame"))