sklearn LogisticRegression classifier performance varies with same element values but different hash-range sparse matrix

Question

I was trying to train a lr classifier against text dataset, different from common scene where text data directly feed to tfidf vectorizer, orginal text line was first transformed into dictionary like {a:0.1, phrase:0.5, in:0.3, line:0.8}, in which weights were computed due to some specific rules and some words were omitted. so, in order to feed these dictionaries to lr classifier, I chose FeatureHasher to do the hash trick. However, I found the lr classifier worked extremely slow when the n_features param of FeatureHasher grew large, say 10^8.

But as far as I know, both memory-cost and calculation-cost of sparse matrix should not grow with dimensions while the number of valid elements is fixed. For example, if we have a two-element sparse vector [coordinate:(1,2), value:(3,4)], where its original dimension is 10. we change the hash-range to 20, and we get [(3,7), (3,4)], there is no difference in storing these two vectors, and if we calculate its distance with another sparse vector, we only need to traverse to list with fixed number of elements therefore calculation-cost if fixed.

I think there must be something wrong with my understanding, or I should have missed something with the lr classifier of sklearn, hope someone would correct me, thanks!

I could be very off but one thought is that it could be because of the sparse dot performance (used in lr: [`safe_sparse_dot`](https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/utils/extmath.py)). While `np.dot` uses blas and other enhancements, a sparse dot may not be as good as dense dimensions grow. Also, take a look at this: http://stackoverflow.com/questions/18595981/improving-performance-of-multiplication-of-scipy-sparse-matrices — mkaran, Feb 14 '17 at 13:10
@mkaran, I think this has nothing to do with dot operation, cause we can assure the element numbers are the same and fixed in both scenarios. I tried 20M and 200M, the former would finish training in minutes while the later would last for 3+ hours. — kuixiong, Feb 20 '17 at 08:35
I may be missing something or misunderstanding your question but: you mean the rows (the samples you want to predict) remain the same but the columns (features) change and increase, correct? Which means, as I understand it, the sparsity decreases and the length of the sparse matrix increases, e.g. the shape of your x_train with 1000 samples would be something like (1000, 100) if you set n_features=100 and (1000, 1000) with n_features=1000 since `n_features` is The number of features (columns) in the output matrices. — mkaran, Feb 20 '17 at 09:39

sklearn LogisticRegression classifier performance varies with same element values but different hash-range sparse matrix

0 Answers0