14

I've tried for several hours to calculate the Entropy and I know I'm missing something. Hopefully someone here can give me an idea!

EDIT: I think my formula is wrong!

CODE:

 info <- function(CLASS.FREQ){
      freq.class <- CLASS.FREQ
      info <- 0
      for(i in 1:length(freq.class)){
        if(freq.class[[i]] != 0){ # zero check in class
          entropy <- -sum(freq.class[[i]] * log2(freq.class[[i]]))  #I calculate the entropy for each class i here
        }else{ 
          entropy <- 0
        } 
        info <- info + entropy # sum up entropy from all classes
      }
      return(info)
    }

I hope my post is clear, since it's the first time I actually post here.

This is my dataset:

buys <- c("no", "no", "yes", "yes", "yes", "no", "yes", "no", "yes", "yes", "yes", "yes", "yes", "no")

credit <- c("fair", "excellent", "fair", "fair", "fair", "excellent", "excellent", "fair", "fair", "fair", "excellent", "excellent", "fair", "excellent")

student <- c("no", "no", "no","no", "yes", "yes", "yes", "no", "yes", "yes", "yes", "no", "yes", "no")

income <- c("high", "high", "high", "medium", "low", "low", "low", "medium", "low", "medium", "medium", "medium", "high", "medium")

age <- c(25, 27, 35, 41, 48, 42, 36, 29, 26, 45, 23, 33, 37, 44) # we change the age from categorical to numeric
Léo Léopold Hertz 준영
  • 134,464
  • 179
  • 445
  • 697
Codex
  • 193
  • 1
  • 2
  • 8
  • 1
    Ironically of course, the worse the calculation, the closer the answer. – Strawberry Dec 02 '14 at 16:58
  • It would be good to post (a) the formula you think is right, and (b) a sample of the type of data you will feed to this function. Using `dput()` is a great way to share data. – Gregor Thomas Dec 02 '14 at 17:01
  • And the answer should be: 0.940286. – Codex Dec 02 '14 at 17:38
  • @Codex, which object is that answer referring to (e.g. age? income?) or are you trying to combine all objects? Quickly running through some, it appears to refer to 'buys'? – cdeterman Dec 02 '14 at 17:43
  • @cdeterman It is not specified in the answer what exactly the answer is referring to, that's why I'm so frustrated. But my thoughts are either "buys" since its the class label or alternatively all objects combined. – Codex Dec 02 '14 at 17:47
  • @cdeterman, How are you actually running through them? I think i'm failing/misunderstanding that part. – Codex Dec 02 '14 at 17:51

2 Answers2

23

Ultimately I find no error in your code as it runs without error. The part I think you are missing is the calculation of the class frequencies and you will get your answer. Quickly running through the different objects you provide I suspect you are looking at buys.

buys <- c("no", "no", "yes", "yes", "yes", "no", "yes", "no", "yes", "yes", "yes", "yes", "yes", "no")
freqs <- table(buys)/length(buys)
info(freqs)
[1] 0.940286

As a matter of improving your code, you can simplify this dramatically as you don't need a loop if you are provided a vector of class frequencies.

For example:

# calculate shannon-entropy
-sum(freqs * log2(freqs))
[1] 0.940286

As a side note, the function entropy.empirical is in the entropy package where you set the units to log2 allowing some more flexibility. Example:

entropy.empirical(freqs, unit="log2")
[1] 0.940286
cdeterman
  • 19,630
  • 7
  • 76
  • 100
4

There is an another way similar to above answer but using a different function.

> buys <- c("no", "no", "yes", "yes", "yes", "no", "yes", "no", "yes", "yes", "yes", "yes", "yes", "no")

> probabilities <- prop.table(table(buys))

> probabilities
buys
       no       yes 
0.3571429 0.6428571 

> -sum(probabilities*log2(probabilities))

[1] 0.940286

Also there is a built in function entropy.empirical(probabilities, unit = "log2")

mcgusty
  • 1,354
  • 15
  • 21