8

I am trying to apply NMF on my dataset, using python scikit-learn. My dataset contains 0 values and missing values. But scikit-learn does not allow NaN value in data matrix. Some posts said that replace missing values with zeros.

my questions are:

  • If I replace missing value with zeros, how can the algorithm tell the missing values and real zero values?

  • Is there any other NMF implementations can deal with missing values?

  • Or if there are any other matrix factorization algorithms can do the missing value prediction?

Zhaojie Tao
  • 113
  • 1
  • 6
  • 1
    The replacement of missing-values with zero (or column-mean, or row-mean or ...) is not known by the classifier. It will treat these numbers as any other which might be okay (we are always assuming a low-rank model exists with these methods). / In general i would say, that missing-value prediction is a harder problem (which needs stronger assumptions) compared to finding a low-rank factorization of a matrix without missing-values. As an alternative: write a SGD-based optimizer for some common nmf-problem (and you can sample from the known values only) – sascha Sep 07 '16 at 10:52
  • 1
    Thanks, it seems ignoring missing values when applying SGD is the solution. – Zhaojie Tao Sep 19 '16 at 06:02
  • Facing the same problem. Have you written your own SGD implementation? If yes, how is it performing? So far I have not been able to achieve anything that performs similar to NMF. – silentser Feb 02 '17 at 16:39
  • @silentser Yes I have tried my own SGD implementation. It has similar performance compared with sklearn implementation, but much slower. – Zhaojie Tao Mar 31 '17 at 03:38

2 Answers2

3

There is a thread about this in scikit-learn github and a version seams to be available but not yet commited to the main code.

https://github.com/scikit-learn/scikit-learn/pull/8474

Cristiana SP
  • 143
  • 2
  • 9
1

SGD will do the job here, but scikit-learn does not have one that could be applied for the task. Writing your own one will do the job, but will be really slow since one cannot directly parallelise matrix factorization SGD. Check Distributed SGD algorithm described here. It is not so hard to implement and it speeds up things significantly.

silentser
  • 2,083
  • 2
  • 23
  • 29
  • 1
    The link seems broken. Is this the same one as your original? http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.230.7682&rep=rep1&type=pdf – Kostas Mouratidis Mar 03 '19 at 11:04