3

I am using Scikit-learn for converting my train data to polynomials features and then fit it to a linear model.

model = Pipeline([('poly', PolynomialFeatures(degree=3)),
              ('linear', LinearRegression(fit_intercept=False))])
model.fit(X, y)

But it throws an error

TypeError: A sparse matrix was passed, but dense data is required

I know my data is sparse matrix format. So when I try to convert my data to dense matrix it shows memory error. Because my data is huge(50k~). Because of these large amounts of data I can't convert it to a dense matrix.

I also find Github Issues where this feature is requested. But still not implemented.

So please can someone tell how to use sparse data format in PolynomialFeatures in Scikit-learn without converting it to dense format?

Niyamat Ullah
  • 2,384
  • 1
  • 16
  • 26
  • 1
    As you have already discussed this on the github issue, there's little we can do here at stackoverflow. Maybe you can try implementing your own version and come here when finding difficulties in that. – Vivek Kumar Jan 11 '18 at 07:11
  • 1
    Seems like the developers have make a PR for [the same here](https://github.com/scikit-learn/scikit-learn/pull/10452). Please have a look at it. – Vivek Kumar Jan 11 '18 at 07:12
  • 1
    Fair points. Try selecting a subset of your columns only in order to create polynomial features. – geompalik Jan 11 '18 at 14:42

3 Answers3

2

This is a new feature in the upcoming 0.20 version of sklearn. See Release History - V0.20 - Enhancements If you really wanted to test it out you could install the development version by following the instructions in Sklean - Advanced Installation - Install Bleeding Edge.

Grr
  • 15,553
  • 7
  • 65
  • 85
2

Since version 0.21.0, the PolynomialFeatures class accepts CSR matrices for degrees 2 and 3. The method laid out here is used, and the computation is much, much faster than if the input is a CSC matrix or dense (assuming the data's sparse to any reasonable degree - even slightly).

Andrew
  • 897
  • 1
  • 9
  • 19
1

While we are waiting for the latest update of Sklearn - you can find an implementation of sparse interaction here:

https://github.com/drivendataorg/box-plots-sklearn/blob/master/src/features/SparseInteractions.py