I'm wondering what is the best way to handle sparse+non-sparse data in e.g. a Ridge regression using scikit learn.
Ridge can handle both sparse and nonsparse data.
Imagine something simple as a description
(text) field that gets Count/Tdidf Vectorized (sparse), and an income
continuous variable.
Now imagine that we have several other text fields and several other continuous variables.
What is the best way to model some continuous y
variable?
I've considered making two separate models (one using sparse data, one using non-sparse) and somehow trying to combine.
I've also considered using PCA to make the sparse data into a "handleable" amount of continuous features.
How do you usually solve this issue?
Note: the continuous variables would have many unique values (and you'd lose power anyway when converting continuous to bins), and the text fields might end up having like a million features, thus not able to be dense.