-1

I have a corpus consisting of thousands of lines. For the sake of simplicity, lets consider the corpus to be:

Today is a good day
I hope the day is good today
It's going to rain today
Today I have to study

How do I calculate the entropy using the corpus above ? The formula for the entropy is given as:

enter image description here

This is my understanding so far: Pi refers to the probability of the individual signs which calculated as frequency(P) / (total num of characters). What I fail to understand is the summation ? I am not sure how summation works in this specific formula ?

I am using Python 3.5.2 for the statistical data analysis. It would be really great if anybody could help me out with a code snippet for entropy calculation.

RDM
  • 1,136
  • 3
  • 28
  • 50

1 Answers1

3

You can use SciPy https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.entropy.html to calculate entropy.

Or write something like that:

import math
def Entropy(string,base = 2.0):
    #make set with all unrepeatable symbols from string
    dct = dict.fromkeys(list(string))

    #calculate frequencies
    pkvec =  [float(string.count(c)) / len(string) for c in dct]

    #calculate Entropy
    H = -sum([pk  * math.log(pk) / math.log(base) for pk in pkvec ])
    return H


print(Entropy("Python is not so easy"))

It returns 3.27280432733.

Sklert
  • 242
  • 5
  • 12