5

Tsallis entropy for discrete variable is defined by:

H[p,q] = 1/(q-1) * (1 - sum(p^q))

Tsallis entropy for continous variable is defined by:

H[p,q] = 1/(q-1) * (1 - int((p(x)^q dx)

where p(x) is the Probability Density Function of data, and int is integral.

I'm trying to implement the Tsallis entropy in R.

Suppose that I have the following data (generated by the beta function, but let consider that the distribution is unknown)

set.seed(567)
mystring <- round(rbeta(500, 2,4), 2)

The Tsallis entropy for discrete variable would be:

freqs <- table(mystring) / 500
q = 3
H1 <- 1/(q-1) * (1 - sum(freqs^q))
[1] 0.4998426

I now want to compute the Tsallis entropy for continuous variable:

PDF <- density(mystring)
library(sfsmisc)
xPDF <- PDF$x
yPDF <- PDF$y
H1 <- 1/(q-1) * (1 - integrate.xy(xPDF, yPDF^q))
[1] -0.6997353

As I would expect, the two results are different. But why so different? And my primary question is: is the code for calculating Tsallis entropy for continuous variable, correct? Please remember that I'm assuming the distribution is unknown.

Tommaso
  • 527
  • 1
  • 5
  • 17

1 Answers1

3

So first of all, this is a statistics question. I encourage you to ask it on stats.stackexchange.com, where you are likely to get a much better answer.

Having said that, why do you assume the values should be the same? You are taking a random sample of size n (n = 500) from a beta distribution, and attempting to discretize it by calculating the fraction of observations in each of k bins of size dx (here, dx = 0.01 and k ~ 100). In general, the fraction in each bin will depend on k, as

pi = pio / k

where pio is the vector of probabilities for some baseline k = ko. In other words, the more (smaller) bins you have, the fewer obxervations per bin. You can see that by plotting histograms with varying k (using breaks=k).

par(mfrow=c(1,3))
hist(mystring,breaks=10,  ylim=c(0,100))
hist(mystring,breaks=50,  ylim=c(0,100))
hist(mystring,breaks=100, ylim=c(0,100))

Your freqs vector is Frequency/500, but the effect of k is the same. The number bins of course is equal to k, so

sum( pi ) = 1

independent of k. But in the calculation of Tsallis entropy you are not summing pi, you are summing piq (in your case q=3). So

sum( piq ) ~ sum( [ pio/k ]q ) ~ (1 / kq) * sum( [ pio ]q )

Sine you are summing k terms, when q = 1 the result will not depend on k, but for any other q, the sum will depend on k. In other words, the Tsallis entropy calculated from a discretized continuous distribution will depend on the bin size used to discretize.

To make this concrete, consider a discretized U[0,1] with 10 bins. This a a vector of length 10 with all elements = 0.1. Using q=3 as in your example,

k <- 10
p <- rep(1/k,k)
sum(p^q)
# [1] 0.01

Now consider the same thing with 100 bins. Here p is a vector of length 100 with all elements = 0.01.

k <- 100
p <- rep(1/k,k)
sum(p^q)
# [1] 1e-04

Finally consider the continuous distribution. The pdf of U[0,1] = 1 on (0,1), 0 elsewhere, so the integral is int(1^3 dx) = 1.

f <- function(x) dunif(x)^q
integrate(f,0,1)$value
# 1

Finally, we can show that integrating your empirical density function (based on dbeta) gives about the same answer as directly integrating the distribution function:

library(sfsmisc)
PDF <- density(mystring)
H2 <- 1/(q-1) * (1 - integrate.xy(PDF$x, PDF$y^q))
H2
# [1] -0.6997353
g <- function(x) dbeta(x,2,4)^q
H3 <- 1/(q-1) * (1 - integrate(g,-Inf,Inf)$value)
H3
# [1] -0.8986014
jlhoward
  • 58,004
  • 7
  • 97
  • 140