3

Using dtm it is possible to take the term frequency.

How is it possible or is there any easy way to calculate the entropy? It is giving higher weight to the terms with less frequency in some documents.

entropy = 1 + (Σj pij log2(pij)/log2n

pij = tfij / Σj tfij

tfij is the number of times word i occurs in document j.

Ken Benoit
  • 14,454
  • 27
  • 50
Airi
  • 43
  • 5

1 Answers1

4

Here's a function for doing that, although it could be improved by maintaining sparsity in the p_ij and log computations (this is how dfm_tfidf() is written for instance). Note that I changed the formula slightly since (according to https://en.wikipedia.org/wiki/Latent_semantic_analysis#Mathematics_of_LSI among other sources) there should be a minus in front of the sum.

library("quanteda")
textstat_entropy <- function(x, base = exp(1), k = 1) {
    # this works because of R's recycling and column-major order, but requires t()
    p_ij <- t(t(x) / colSums(x))

    log_p_ij <- log(p_ij, base = base)
    k - colSums(p_ij * log_p_ij / log(ndoc(x), base = base), na.rm = TRUE)
}

textstat_entropy(data_dfm_lbgexample, base = 2)
#        A        B        C        D        E        F        G        H        I        J        K 
# 1.000000 1.000000 1.000000 1.000000 1.000000 1.045226 1.045825 1.117210 1.173655 1.277210 1.378934 
#        L        M        N        O        P        Q        R        S        T        U        V 
# 1.420161 1.428939 1.419813 1.423840 1.436201 1.440159 1.429964 1.417279 1.410566 1.401663 1.366412 
#        W        X        Y        Z       ZA       ZB       ZC       ZD       ZE       ZF       ZG 
# 1.302785 1.279927 1.277210 1.287621 1.280435 1.211205 1.143650 1.092113 1.045825 1.045226 1.000000 
#        ZH       ZI       ZJ       ZK 
# 1.000000 1.000000 1.000000 1.000000 

This matches the weight function in the lsa package, when the base is e:

library("lsa")
all.equal(
    gw_entropy(as.matrix(t(data_dfm_lbgexample))),
    textstat_entropy(data_dfm_lbgexample, base = exp(1))
)
# [1] TRUE
Ken Benoit
  • 14,454
  • 27
  • 50