-1

I have a data set with x attributes and y records. Given an input record which has up to x-1 missing values, how would I reasonably approximate one of the remaining missing values?

So in the example below, the input record has two values (for attribute 2 and 6, with the rest missing) and I would like to approximate a value for attribute 8.

Data table with input I want to classify

I know missing values are dealt with through 'imputation' but I'm generally finding examples regarding pre-processing datasets. I'm looking for a solution which uses regression to determine the missing value and ideally makes use of a model which is built once (if possible, to not have to generate one each time).

Z-Mehn
  • 268
  • 1
  • 2
  • 12
  • 1
    It would be helpful if you provided your sample data as text instead of an image. We cannot cut and paste an image. – G5W Jan 13 '17 at 18:02

1 Answers1

0

The number of possibilities for which attributes are present or absent, makes it seem impractical to be able to maintain a collection of models like linear regressions that would cover all of the cases. The one model that seems practical to me is the one that you don't exactly make any model - Nearest Neighbors Regression. My suggestion would be to use whatever attributes you have available and compute distance to your training points. You could use the value from the nearest neighbor or the (possibly weighted) average of several nearest neighbors. In your example, we would use only attributes 2 and 6 to compute distance. The nearest point is the last one (3.966469, 8.911591). That point has value 6.014256 for attribute 8, so that is your estimate of attribute 8 for the new point.

Alternatively, you could use three nearest neighbors. Those are points 17, 8 and 12, so you could use the average of the values of attribute 8 for those points, or a weighted average. People sometimes use the weights 1/dist. Of course, three neighbors is just an example. You could pick another k.

This is probably better than using the global average (8.4) for all missing values of attribute 8.

G5W
  • 36,531
  • 10
  • 47
  • 80
  • Thanks for your answer. Using a k-nearest algorithm though wouldn't work outside of the bounds of the training set would it? If there was a 1:1 correlation between two attributes, e.g: (4,4) (5,5) (5,5) (6,6) (7,7) (7,7) (8,8) And I had the following input: (2,x) Using the 3 nearest neighbors would mean a predicted x value of 4.67, (or if weighted by 1/dist, 4.57). – Z-Mehn Jan 15 '17 at 04:22
  • You are right that the situation you describe would produce doubtful results but that is true with any method. In your example, I think that you are assuming a model of the data (a line). If you know that model and only have to estimate the parameters, you may (like in your example) be able to do better. But suppose that your function was a quadratic and had error in the measurements. Extrapolation will do poorly. And what if you did not know the underlying form of the function and just used something that fit the training data. Again, extrapolation is dangerous. – G5W Jan 15 '17 at 13:59