4

I am trying out this method as a regularized regression, as an alternative to lasso and elastic net. I have 40k data points and 40 features. Lasso selects 5 features, and orthogonal matching pursuit selects only 1.

What could be causing this? Am I using omp the wrong way? Perhaps it is not meant to be used as a regression. Please let me know if you can thing of anything else I may be doing wrong.

andrechalom
  • 737
  • 3
  • 13
Baron Yugovich
  • 3,843
  • 12
  • 48
  • 76
  • 1
    I can't help as I don't know a thing about scikit-learn, but you need to provide us with some more details. What is the code that you're running? Can you provide us with a small dataset that reproduces your problem? – andrechalom Apr 04 '16 at 17:41
  • 2
    You question is a much better guess for http://stats.stackexchange.com/ - Good luck! – Framester Apr 04 '16 at 18:54
  • 1
    Please post complete, runnable code and dataset. The question is impossible to answer otherwise. – Alex I Apr 04 '16 at 19:23
  • What are you setting as the target sparsity when you create the omp object? Do you get an error/warning when you run the code for the first time? – obachtos Feb 17 '17 at 11:09

1 Answers1

2

Orthogonal Matching Pursuit seems a bit broken, or at least very sensitive to input data, as implemented in scikit-learn.

Example:

import sklearn.linear_model 
import sklearn.datasets 
import numpy

X, y, w = sklearn.datasets.make_regression(n_samples=40000, n_features=40, n_informative=10, coef=True, random_state=0)

clf1 = sklearn.linear_model.LassoLarsCV(fit_intercept=True, normalize=False, max_n_alphas=1e6) 
clf1.fit(X, y)

clf2 = sklearn.linear_model.OrthogonalMatchingPursuitCV(fit_intercept=True, normalize=False)
clf2.fit(X, y)

# this is 1e-10, LassoLars is basically exact on this data 
print numpy.linalg.norm(y - clf1.predict(X))

# this is 7e+8, OMP is broken
print numpy.linalg.norm(y - clf2.predict(X))

Fun experiments:

  • There are a bunch of canned datasets in sklearn.datasets. Does OMP fail on all of them? Apparently, it works okay on the diabetes dataset...

  • Is there any combination of parameters to make_regression that would generate data that OMP works for? Still looking for that one... 100 x 100 and 100 x 10 fail in the same way.

Alex I
  • 19,689
  • 9
  • 86
  • 158
  • Maybe then this should be posted as an Issue at the [scikit-learn github](https://github.com/scikit-learn/scikit-learn) – João Almeida Apr 05 '16 at 20:45
  • @JoãoAlmeida: Yes, probably. I want to be sure it is a bug and not how OMP is supposed to work though. I'd try some more simple synthetic data first. – Alex I Apr 05 '16 at 20:54
  • 1
    @BaronYugovich: Do you have any more questions about this? If I've ansered your question please remember to accept and award the bounty. – Alex I Apr 11 '16 at 11:13