1

In data pre-processing, Data Binning is a technique to convert continuous values of a feature to categorical ones. For example, sometimes, the values of age feature in datasets are replaced with one of intervals such as:

[10,20),
[20,30),
[30,40].

When is the best time to use Data Binning? Does it (always) lead to a better result in a predication system or it may work as a trial and error?

Javad.Rad
  • 93
  • 1
  • 8

1 Answers1

2

Trial and error mostly. When you apply binning to a continuous variable you automatically throw away some information. Many algorithms would prefer a continuous input to make a prediction and many would bin the continuous input themselves. Binning would be wise to apply if your continuous variable is noisy, meaning the values for your variable were not recorded very accurately. Then, binning could reduce this noise. There are binning strategies such as equal width binning or equal frequency binning. I would recommend avoiding equal width binning when your continuous variable is unevenly distributed.

Gaussian Prior
  • 756
  • 6
  • 16
  • Regardless of noise reduction, is it true to say that Binning reduces the complexity of the feature space in a good/useful way? That is, the samples which have continuous values in a range across a dimension (in our example, `age`) will be stacked on particular discrete points, so, either it has no effect or a good effect? – Javad.Rad Dec 28 '20 at 14:37
  • Well besides noise reduction I can't tell for sure that stacking on particular discrete points would be better. I mean if your age is continuous your algorithm may attribute desired characteristics to groups that fall between two binned age groups, meaning that if you were to classify into "Creditworthiness good" or "bad" then having an input of age 25 yo would probably contribute more meaningfully to the final output than having Group A, which would consist of people of age 18-25. – Gaussian Prior Dec 28 '20 at 15:17