Best way to handle sparse + non-sparse data to create a model

Question

I'm wondering what is the best way to handle sparse+non-sparse data in e.g. a Ridge regression using scikit learn.

Ridge can handle both sparse and nonsparse data.

Imagine something simple as a description (text) field that gets Count/Tdidf Vectorized (sparse), and an income continuous variable.

Now imagine that we have several other text fields and several other continuous variables.

What is the best way to model some continuous y variable?

I've considered making two separate models (one using sparse data, one using non-sparse) and somehow trying to combine.

I've also considered using PCA to make the sparse data into a "handleable" amount of continuous features.

How do you usually solve this issue?

Note: the continuous variables would have many unique values (and you'd lose power anyway when converting continuous to bins), and the text fields might end up having like a million features, thus not able to be dense.

score -1 · Answer 1 · answered Oct 23 '15 at 08:02

this reply may be a little out of context, but i want to understand by "Ridge can handle both sparse and no-sparse data"? I am trying to run a logistic regression model in R which has all text fields, however, my dependent variable is very sparse. Only .9%. Do you think Ridge would be good algo to implement?

Best way to handle sparse + non-sparse data to create a model

1 Answers1