8

I do know formula for calculating entropy:

H(Y) = - ∑ (p(yj) * log2(p(yj)))

In words, select an attribute and for each value check target attribute value ... so p(yj) is the fraction of patterns at Node N are in category yj - one for true in target value and one one for false.

But I have a dataset in which target attribute is price, hence range. How to calculate entropy for this kinda dataset?

(Referred: http://decisiontrees.net/decision-trees-tutorial/tutorial-5-exercise-2/)

Andy Hayden
  • 359,921
  • 101
  • 625
  • 535
code muncher
  • 1,592
  • 2
  • 27
  • 46

2 Answers2

7

You first need to discretise the data set in some way, like sorting it numerically into a number of buckets. Many methods for discretisation exist, some supervised (ie taking account the value of your target function) and some not. This paper outlines various techniques used in fairly general terms. For more specifics there are plenty of discretisation algorithms in machine learning libraries like Weka.

The entropy of continuous distributions is called differential entropy, and can also be estimated by assuming your data is distributed in some way (normally distributed for example), then estimating underlaying distribution in the normal way, and using this to calculate an entropy value.

Vic Smith
  • 3,477
  • 1
  • 18
  • 29
  • but how can I decide ranges? suppose I sorted the data, how to decide range ... just guessing, if I want binary then avg of this data? – code muncher Jan 16 '13 at 17:06
  • There are many methods used for this, I will add more information to the answer, give me a sec... – Vic Smith Jan 16 '13 at 17:11
  • oops this doesnt make sense .. if attributes are having two values then binary ... thanks @Vic Smith! – code muncher Jan 16 '13 at 17:11
  • You have decision tree whose output is continuous. So let's split the dataset on basis of range. Take attribute say price and split it as price in range r1, r2... Now figure out your data set values lies in which range. For all values in Ri, probability is ri /total number of prices or instances. Now put the values in entropy formula – V SAI MAHIDHAR Sep 20 '18 at 11:36
1

Concur with Vic Smith, Discretization is generally a good way to go. In my experience, most seemingly continuous data are actually "lumpy" and little is lost.

However, if discretization is undesirable for other reasons, entropy is also defined for continuous distributions (see wikipedia on your favorite distribution, e.g. http://en.wikipedia.org/wiki/Normal_distribution]).

One approach would be to assume a form of distribution, e.g. normal, lognormal, etc., and calculate entropy from estimated parameters. I don't think the scales of Boltzmann entropy (continuous) and Shannon entropy (discrete) are on the same scales, so wouldn't mix them.

prototype
  • 7,249
  • 15
  • 60
  • 94