0

In response to @j.jerrod.taylor's answer, let me rephrase my question to clear any misunderstanding.

I'm new to Data Mining and am learning about how to handle noisy data by smoothing my data using the Equal-width/Distance Binning method via "Bin Boundaries". Assume the dataset 1,2,2,3,5,6,6,7,7,8,9. I want to perform:

  1. distance binning with 3 bins, and
  2. Smooth values by Bin Boundaries based on values binned in #1.

Based on definition in (Han,Kamber,Pei, 2012, Data Mining Concepts and Techniques, Section 3.2.2 Noisy Data):

In smoothing by bin boundaries, the minimum and maximum values in a given bin are identified as the bin boundaries. Each bin value is then replaced by the closest boundary value.

  • Interval width = (max-min)/k = (9-1)/3 = 2.7
  • Bin intervals = [1,3.7),[3.7,6.4),[6.4,9.1]

  • original Bin1: 1,2,2,3 | Bin boundaries: (1,3) | Smooth values by Bin Boundaries: 1,1,1,3

  • original Bin2: 5,6,6 | Bin boundaries: (5,6) | Smooth values by Bin Boundaries: 5,6,6
  • original Bin3: 7,7,8,9 | Bin boundaries: (7,9) | Smooth values by Bin Boundaries: 7,7,8,9

Question: - where does 8 belong to in Bin3 when binned using Bin boundaries method, since it's +1 from 7 and -1 from 9?

user2771721
  • 502
  • 1
  • 6
  • 13

2 Answers2

0

If this is a problem, then you are calculating your bin widths incorrectly. For example, creating a histogram is an example of data binning.

You can read this response on crossvalidated. But in general if you're trying to bin integers, then your boundary will be a double.

For example if you want everything between 2 and 6 to be in one bin, your actual boundary will be 1.5 to 6.5. Since all of your data are integers there is no chance for anything to not be classified.

edit:I also have the same book, though it seems like I have a different version because the section on Data Discretization is in chapter 2 instead of chapter 3 like you pointed out. Based on your question, it seems like you don't really understand the concept yet.

The following is an except from page 88 Chapter 2 on Data Preprocessing. I'm using the second edition of the text.

For example, attribute values can be discretized by applying equal-width or equal-frequency binning, and then replacing each bin value by the bin mean or median, as in smoothing by bin means or smoothing by bin medians, respectively. 8 doesn't belong anywhere other than in bin 3. This gives you two options. You can either take the mean/median of all of the numbers that fall in bin 3 or you can use bin 3 as a category.

The building on your example, we can take the mean of the 4 numbers in bin 3. This gives us 7.75. We would now use 7.75 for the four numbers that are in that bin instead of 7,7,8 and 9.

The second option would be to use the bin number. For example, everything in bin 3 would get a category label of 3, everything in bin 2 would get a label of 2, etc.

Community
  • 1
  • 1
j.jerrod.taylor
  • 1,120
  • 1
  • 13
  • 33
0

UPDATE WITH CORRECT ANSWER:

My class finally covered this topic, and the answer to my own question is that 8 can belong to either 7 or 9. This scenario is described as "tie-breaking", where the value is equal distance from either boundary. It is acceptable for all such values to be consistently tied to the same boundary.

Here's is a real example of a NIH analysis paper that explains using "tie breaking" when they encounter equal-distance values: http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3807594/

user2771721
  • 502
  • 1
  • 6
  • 13