0

I'm just beginning to use R for text mining and have come across a problem.

I have successfully charted tf_idf for single words in my dataset which includes 3 different columns (positive, negative, and bank) - the column name is 'Box'.

I am trying to do the same for bigrams and trigrams and using the same code:

Trigram_tibble %>%
  arrange(desc(tf_idf)) %>%
  mutate(trigram = factor(trigram, levels = rev(unique(trigram)))) %>% 
  group_by(Box) %>% 
  top_n(10, tf_idf) %>% 
  ungroup %>%
  ggplot(aes(trigram, tf_idf, fill = Box)) +
  geom_col(show.legend = FALSE) +
  labs(x = NULL, y = "tf-idf") +
  facet_wrap(~Box, ncol = 2, scales = "free") +
  coord_flip()

I have discovered (I think) that the 'top_n' function returns the trigrams which have the top ranking, and that it automatically uses the last variable in the tibble(in my case this is tf_idf, and I've chosen n-10). However, when running this for bigrams I am only able to produce charts that seem to have several hundred (thousand??) bigrams along the y axis.

tf-idf grouped by variable

In the picture you can see that the negative variable seems fine (I've redacted it for data protection), but the other two are not!

I took this code from the tidy text mining book originally.

EDIT - ADDING A SAMPLE OF THE DATA

sample data

My best guess now is that the 'top_n' tf_idf scores happen to have many that are exactly the same. In which case I'm now not sure this is a useful calculation and I'm wondering why it happened to work so well in the tidy text book, but not for my data.

EDIT 2

I cut down the Trigram_tibble to 50 observations and this is the output of dput(Trigram_tibble) (I've obscured the survey response text trigrams)

a<-Trigram_tibble [1:50, 1:8] dput(a) structure(list(Respondent = c(1294L, 2693L, 42L, 463L, 463L, 1481L, 1706L, 1891L, 1917L, 2442L, 2693L, 3590L, 3590L, 3916L, 4454L, 4682L, 5996L, 6283L, 6283L, 6568L, 9101L, 2L, 3L, 4L, 4L, 4L, 8L, 12L, 12L, 13L, 13L, 13L, 13L, 13L, 13L, 13L, 13L, 18L, 18L, 18L, 18L, 20L, 21L, 21L, 21L, 21L, 21L, 21L, 21L, 21L ), Box = c("Positive", "Negative", "Negative", "Negative", "Negative", "Negative", "Bank", "Positive", "Negative", "Negative", "Negative", "Bank", "Bank", "Negative", "Positive", "Negative", "Negative", "Negative", "Negative", "Negative", "Negative", "Negative", "Negative", "Negative", "Negative", "Negative", "Negative", "Bank", "Bank", "Negative", "Negative", "Negative", "Negative", "Negative", "Positive", "Positive", "Positive", "Negative", "Negative", "Negative", "Negative", "Negative", "Bank", "Bank", "Bank", "Negative", "Negative", "Negative", "Negative", "Negative"), trigram = c("xxx xxx xxx", "xxx xxx xxx", "xxx xxx xxx", "xxx xxx xxx", "xxx xxx xxx", "xxx xxx xxx", "xxx xxx xxx", "xxx xxx xxx", "xxx xxx xxx", "xxx xxx xxx", "xxx xxx xxx", "xxx xxx xxx", "xxx xxx xxx", "xxx xxx xxx", "xxx xxx xxx", "xxx xxx xxx", "xxx xxx xxx", "xxx xxx xxx", "xxx xxx xxx", "xxx xxx xxx", "xxx xxx xxx", "xxx xxx xxx", "xxx xxx xxx", "xxx xxx xxx", "xxx xxx xxx", "xxx xxx xxx", "xxx xxx xxx", "xxx xxx xxx", "xxx xxx xxx", "xxx xxx xxx", "xxx xxx xxx", "xxx xxx xxx", "xxx xxx xxx", "xxx xxx xxx", "xxx xxx xxx", "xxx xxx xxx", "xxx xxx xxx", "xxx xxx xxx", "xxx xxx xxx", "xxx xxx xxx", "xxx xxx xxx", "xxx xxx xxx", "xxx xxx xxx", "xxx xxx xxx", "xxx xxx xxx", "xxx xxx xxx", "xxx xxx xxx", "xxx xxx xxx", "xxx xxx xxx", "xxx xxx xxx"), n = c(4L, 3L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), total = c(1714L, 2899L, 2899L, 2899L, 2899L, 2899L, 836L, 1714L, 2899L, 2899L, 2899L, 836L, 836L, 2899L, 1714L, 2899L, 2899L, 2899L, 2899L, 2899L, 2899L, 2899L, 2899L, 2899L, 2899L, 2899L, 2899L, 836L, 836L, 2899L, 2899L, 2899L, 2899L, 2899L, 1714L, 1714L, 1714L, 2899L, 2899L, 2899L, 2899L, 2899L, 836L, 836L, 836L, 2899L, 2899L, 2899L, 2899L, 2899L), tf = c(0.00233372228704784, 0.00103483959986202, 0.000689893066574681, 0.000689893066574681, 0.000689893066574681, 0.000689893066574681, 0.00239234449760766, 0.00116686114352392, 0.000689893066574681, 0.000689893066574681, 0.000689893066574681, 0.00239234449760766, 0.00239234449760766, 0.000689893066574681, 0.00116686114352392, 0.000689893066574681, 0.000689893066574681, 0.000689893066574681, 0.000689893066574681, 0.000689893066574681, 0.000689893066574681, 0.00034494653328734, 0.00034494653328734, 0.00034494653328734, 0.00034494653328734, 0.00034494653328734, 0.00034494653328734, 0.00119617224880383, 0.00119617224880383, 0.00034494653328734, 0.00034494653328734, 0.00034494653328734, 0.00034494653328734, 0.00034494653328734, 0.00058343057176196, 0.00058343057176196, 0.00058343057176196, 0.00034494653328734, 0.00034494653328734, 0.00034494653328734, 0.00034494653328734, 0.00034494653328734, 0.00119617224880383, 0.00119617224880383, 0.00119617224880383, 0.00034494653328734, 0.00034494653328734, 0.00034494653328734, 0.00034494653328734, 0.00034494653328734), idf = c(-2.07944154167984, 0.405465108108164, 1.09861228866811, 1.09861228866811, 1.09861228866811, 1.09861228866811, 0, 1.09861228866811, 1.09861228866811, -2.07944154167984, 1.09861228866811, 0.405465108108164, 1.09861228866811, -0.693147180559945, 1.09861228866811, 1.09861228866811, 0, 1.09861228866811, 1.09861228866811, -1.29928298413026, 1.09861228866811, 1.09861228866811, 1.09861228866811, 1.09861228866811, 1.09861228866811, 1.09861228866811, 1.09861228866811, 1.09861228866811, 1.09861228866811, 1.09861228866811, 1.09861228866811, 1.09861228866811, 1.09861228866811, 1.09861228866811, 1.09861228866811, 1.09861228866811, 1.09861228866811, 1.09861228866811, 1.09861228866811, 1.09861228866811, 1.09861228866811, 1.09861228866811, 1.09861228866811, 1.09861228866811, 1.09861228866811, 1.09861228866811, 1.09861228866811, 1.09861228866811, 1.09861228866811, 1.09861228866811), tf_idf = c(-0.00485283907043135, 0.000419591350232664, 0.000757925000805871, 0.000757925000805871, 0.000757925000805871, 0.000757925000805871, 0, 0.0012819279914447, 0.000757925000805871, -0.00143459230195228, 0.000757925000805871, 0.00097001222035446, 0.00262825906379931, -0.000478197433984095, 0.0012819279914447, 0.000757925000805871, 0, 0.000757925000805871, 0.000757925000805871, -0.000896366322269928, 0.000757925000805871, 0.000378962500402935, 0.000378962500402935, 0.000378962500402935, 0.000378962500402935, 0.000378962500402935, 0.000378962500402935, 0.00131412953189965, 0.00131412953189965, 0.000378962500402935, 0.000378962500402935, 0.000378962500402935, 0.000378962500402935, 0.000378962500402935, 0.000640963995722351, 0.000640963995722351, 0.000640963995722351, 0.000378962500402935, 0.000378962500402935, 0.000378962500402935, 0.000378962500402935, 0.000378962500402935, 0.00131412953189965, 0.00131412953189965, 0.00131412953189965, 0.000378962500402935, 0.000378962500402935, 0.000378962500402935, 0.000378962500402935, 0.000378962500402935)), row.names = c(NA, -50L), class = c("tbl_df", "tbl", "data.frame"))

Community
  • 1
  • 1
Jennimh
  • 3
  • 2
  • It would be very helpful if you could include a small amount of sample data. – WaltS May 22 '18 at 12:28
  • Hi, I'm trying to add a sample however the characters wont fit. Df= 8 columns Respondent, Box, Trigram, n, total, tf, idf, tf_idf – Jennimh May 22 '18 at 13:49
  • Can you modify your data to remove any personal information then post the output of `dput(Trigram_tibble)`? – Tung May 22 '18 at 20:28
  • ok i've added it. had to add as a picture because the formatting didn't show properly in text. – Jennimh May 23 '18 at 10:54
  • i have edited to add the output of dput(Trigram_tibble), using only 50 observations as otherwise it was incredibly long – Jennimh May 23 '18 at 12:21

0 Answers0