4

I'm trying to use the SVD imputation from the bcv package but all the imputed values are the same (by column).

This is the dataset with missing data http://pastebin.com/YS9qaUPs

#load data
dataMiss = read.csv('dataMiss.csv')
#impute data
SVDimputation = round(impute.svd(dataMiss)$x, 2)
#find index of missing values
bool = apply(X = dataMiss, 2, is.na)
#put in a new data frame only the imputed value
SVDImpNA = mapply(function(x,y) x[y], as.data.frame(SVDimputation), as.data.frame(bool))
View(SVDImpNA)

head(SVDImpNA)
        V1   V2   V3
[1,] -0.01 0.01 0.01
[2,] -0.01 0.01 0.01
[3,] -0.01 0.01 0.01
[4,] -0.01 0.01 0.01
[5,] -0.01 0.01 0.01
[6,] -0.01 0.01 0.01

Where am I wrong?

Sojers
  • 87
  • 2
  • 8

2 Answers2

5

The impute.svd algorithm works as follows:

  1. Replace all missing values with the corresponding column means.

  2. Compute a rank-k approximation to the imputed matrix.

  3. Replace the values in the imputed positions with the corresponding values from the rank-k approximation computed in Step 2.

  4. Repeat Steps 2 and 3 until convergence.

In your example code, you are setting k=min(n,p) (the default). Then, in Step 2, the rank-k approximation is exactly equal to imputed matrix. The algorithm converges after 0 iterations. That is, the algorithm sets all imputed entries to be the column means (or something extremely close to this if there is numerical error).

If you want to do something other than impute the missing values with the column means, you need to use a smaller value for k. The following code demonstrates this with your sample data:

> library("bcv")
> dataMiss = read.csv('dataMiss.csv')

k=3

> SVDimputation = impute.svd(dataMiss, k = 3,  maxiter=10000)$x
> table(round(SVDimputation[is.na(dataMiss)], 2))

-0.01  0.01 
531  1062 

k=2

> SVDimputation = impute.svd(dataMiss, k = 2,  maxiter=10000)$x
> table(round(SVDimputation[is.na(dataMiss)], 2))

-11.31  -6.94  -2.59  -2.52  -2.19  -2.02  -1.67  -1.63 
    25     23     61      2     54     23      5     44 
 -1.61   -1.2  -0.83   -0.8  -0.78  -0.43  -0.31  -0.15 
    14     10     13     19     39      1     14     19 
 -0.14  -0.02      0   0.01   0.02   0.03   0.06   0.17 
    83     96     94     77     30     96     82     28 
  0.46   0.53   0.55   0.56   0.83   0.91   1.26   1.53 
     1    209     83     23     28    111     16      8 
  1.77   5.63   9.99  14.34 
   112     12     33      5 

Note that for your data, the default maximum number of iterations (100) was too low (I got a warning message). To fix this, I set maxiter=10000.

Patrick Perry
  • 1,422
  • 8
  • 17
0

The problem that you describe likely occurs because impute.svd initially sets all of the NA values to be equal to the column means, and then doesn't change these values upon convergence.

It depends on the reason that you are using SVD imputation in the first place, but in case you are flexible, a good solution to this problem might be to switch the rank of the SVD call, by setting k to, e.g., 1. Currently, k is set automatically to min(n, p), where n = nrow, and p = ncol, which for your data means k = 3. For example, if you set it to 1 (as it is set in the example in the impute.svd function documentation), then this problem does not occur:

library(bcv) 
dataMiss = read.csv("dataMiss.csv") 
SVDimputation = round(impute.svd(dataMiss, k = 1)$x, 2)

head(SVDimputation) 
      [,1]  [,2]  [,3]
[1,]  0.96 -0.23  0.52
[2,]  0.02 -0.23 -1.92
[3,] -1.87 -0.23  0.52
[4,] -0.92 -0.23  0.52
[5,]  0.49 -0.46  0.52
[6,] -1.87 -0.23  0.52
Andy McKenzie
  • 446
  • 4
  • 12
  • thank you, I already tried to explore that way, but K is the compression parameter for the matrix, so I guess that a small number will speed things up computing an approximation of the svd with a low-rank matrix for the imputation but with poor imputation results – Sojers Feb 28 '16 at 00:59