0

I want to create a Document-feature matrix with tf_idf as weights. If I calculate the tf_idf like in https://quanteda.io/reference/dfm_tfidf.html, I get only zeros. The same if I try to get tf_idf with tidytext from the same token dataset. Looks to me like the information about the number of documents in the corpus cannot be calculated. If i usetidy text from scratch, it works.

harry_token <-harry_data %>% corpus() %>%  
  tokens(remove_punct = TRUE, remove_numbers = TRUE, remove_symbols = TRUE) %>% 
  tokens_remove( c(stopwords("english"))) %>%
  tokens_ngrams( n = 1) 


harry_token <-   tokens_replace(tokens(harry_token), pattern = lemma_en$token, replacement = lemma_en$lemma)



harry_token[1]
[1] "Boy"       "live"      "Mr"        "Mrs"       "Dursley"   "number"    "four"      "Privet"    "drive"    
[10] "proud"     "say"       "perfectly"
[ ... and 40,770 more ]



harry_token  %>%
    +   dfm(verbose = F)

Document-feature matrix of: 7 documents, 13,528 features (52.85% sparse) and 1 docvar.
       features
docs    boy live  mr mrs dursley number four privet drive proud
  text1  99   25  81  46     104     19   32     16    31     7
  text2  60   23 135  96      39      8   25      7    15     2
  text3  61   23  60  24      29     18   14      9    21     6
  text4 121   34 521 155      41     35   54     16    26     6
  text5 102   41 240 276      43     69   65     24    42    11
  text6 105   38 102 154      23     34   34      7    15    10
[ reached max_ndoc ... 1 more document, reached max_nfeat ... 13,518 more features ]

harry_token %>%
+   dfm(verbose = F)  %>% 
+   dfm_tfidf()
Document-feature matrix of: 7 documents, 13,528 features (52.85% sparse) and 1 docvar.
       features
docs    boy live mr mrs dursley number four privet drive proud
  text1   0    0  0   0       0      0    0      0     0     0
  text2   0    0  0   0       0      0    0      0     0     0
  text3   0    0  0   0       0      0    0      0     0     0
  text4   0    0  0   0       0      0    0      0     0     0
  text5   0    0  0   0       0      0    0      0     0     0
  text6   0    0  0   0       0      0    0      0     0     0
[ reached max_ndoc ... 1 more document, reached max_nfeat ... 13,518 more features ]




harry_dfm <- harry_token  %>%
              dfm(verbose = FALSE)

tidy(harry_dfm) %>% 
    bind_tf_idf(term = term, document = document, n = count)

# A tibble: 44,646 x 6
   document term    count       tf   idf tf_idf
   <chr>    <chr>   <dbl>    <dbl> <dbl>  <dbl>
 1 text1    boy        99 0.00243      0      0
 2 text1    live       25 0.000613     0      0
 3 text1    mr         81 0.00199      0      0
 4 text1    mrs        46 0.00113      0      0
 5 text1    dursley   104 0.00255      0      0
 6 text1    number     19 0.000466     0      0
 7 text1    four       32 0.000785     0      0
 8 text1    privet     16 0.000392     0      0
 9 text1    drive      31 0.000760     0      0
10 text1    proud       7 0.000172     0      0
# ... with 44,636 more rows

If I calculate tf_idf from scratch with tidytext ist works.

 harry_data %>% unnest_tokens(word, text) %>% group_by(title) %>%
+     count(word) %>%
+     bind_tf_idf(word, title, n) 
# A tibble: 67,881 x 6
# Groups:   title [7]
   title              word      n        tf   idf     tf_idf
   <chr>              <chr> <int>     <dbl> <dbl>      <dbl>
 1 Chamber of Secrets 0         1 0.0000117 0.847 0.00000992
 2 Chamber of Secrets 1         6 0.0000703 0.154 0.0000108 
 3 Chamber of Secrets 1,5       1 0.0000117 1.95  0.0000228 
 4 Chamber of Secrets 1,520     1 0.0000117 1.95  0.0000228 
 5 Chamber of Secrets 100       1 0.0000117 1.95  0.0000228 
 6 Chamber of Secrets 101       1 0.0000117 1.95  0.0000228 
 7 Chamber of Secrets 102       1 0.0000117 1.95  0.0000228 
 8 Chamber of Secrets 104       1 0.0000117 1.95  0.0000228 
 9 Chamber of Secrets 105       1 0.0000117 1.95  0.0000228 
10 Chamber of Secrets 106       1 0.0000117 1.95  0.0000228 
padul
  • 134
  • 11
  • 2
    Are you sure you have _all_ zeroes, and that you don't also get zeroes for words like "boy" and "live" with tidytext? When all your documents contain some word, the idf is zero and then tf-idf is zero. [Read more here](https://www.tidytextmining.com/tfidf.html#the-bind_tf_idf-function). Also [see here](https://stackoverflow.com/questions/48806699/tidytext-quanteda-and-tm-returning-different-tf-idf-scores/48816238) for some differences between quanteda and tidytext. – Julia Silge Mar 29 '22 at 23:27
  • 2
    "52.85% sparse" in the DFM output means 48% of cells are not zero. Try `topfeatures()` to see what are in the DFM. – Kohei Watanabe Mar 30 '22 at 00:00

0 Answers0