I was trying to train a lr classifier against text dataset, different from common scene where text data directly feed to tfidf vectorizer, orginal text line was first transformed into dictionary like {a:0.1, phrase:0.5, in:0.3, line:0.8}
, in which weights were computed due to some specific rules and some words were omitted. so, in order to feed these dictionaries to lr classifier, I chose FeatureHasher to do the hash trick. However, I found the lr classifier worked extremely slow when the n_features param of FeatureHasher grew large, say 10^8.
But as far as I know, both memory-cost and calculation-cost of sparse matrix should not grow with dimensions while the number of valid elements is fixed. For example, if we have a two-element sparse vector [coordinate:(1,2), value:(3,4)]
, where its original dimension is 10. we change the hash-range to 20, and we get [(3,7), (3,4)]
, there is no difference in storing these two vectors, and if we calculate its distance with another sparse vector, we only need to traverse to list with fixed number of elements therefore calculation-cost if fixed.
I think there must be something wrong with my understanding, or I should have missed something with the lr classifier of sklearn, hope someone would correct me, thanks!