I have a corpus consisting of thousands of lines. For the sake of simplicity, lets consider the corpus to be:
Today is a good day
I hope the day is good today
It's going to rain today
Today I have to study
How do I calculate the entropy using the corpus above ? The formula for the entropy is given as:
This is my understanding so far: Pi refers to the probability of the individual signs which calculated as frequency(P) / (total num of characters)
. What I fail to understand is the summation ? I am not sure how summation works in this specific formula ?
I am using Python 3.5.2
for the statistical data analysis. It would be really great if anybody could help me out with a code snippet for entropy calculation.