Spark MLLib: convert arbitrary, sparse features to a fixed length Vector

Question

We are converting an online machine learning Linear Regression model from Vowpal Wabbit to Spark MLLib. Vowpal Wabbit allows for arbitrary, sparse features by training the model on weights backed by a linked list, whereas Spark MLLib trains on an MLLib Vector of weights which is backed by a fixed length array.

The features we pass to the model are arbitrary strings and not categories. Vowpal Wabbit maps these features to weight values of 1.0 using a hash. We can do the same mapping in MLLib, but are limited to a fixed length array. Is it possible to train such a model in MLLib where the size of the feature space is not known?

Do you mean something like [`HashingTF`](https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.mllib.feature.HashingTF)? — zero323, Feb 29 '16 at 23:42
I looked at HashingTF and I think it wasn't quite what I was looking for because it was calculating term frequency. I had considered using the Hashing Trick with a very large Vector but it's not quite equivalent to the linked list implementation and I'm not sure the large vector will keep up with our stream for training. Any insight will help, though. — Bryan W. Wagner, Mar 01 '16 at 00:13
Input has to be a Vector and there is nothing you can do about it unless you want to rewrite most of the MLLib. `HashingTF` is using `SparseVector` and with numFeatures equal to `Integer.MAX_VALUE` and some post-processing (truncate to 1.0) is the best thing you can get. I am not convinced if it is a sane way to create features but it completely different story :). — zero323, Mar 01 '16 at 00:25
Thanks, yes that's exactly what I'm finding; I think I have an idea to use the single-bit output hash function as described [here](https://en.wikipedia.org/wiki/Feature_hashing#Feature_vectorization_using_the_hashing_trick) with a relatively small vector size, since [some sources](https://en.wikipedia.org/wiki/Feature_hashing#Applications_and_practical_performance) say hashing collisions don't impact the model in a very negative way. This is a tradeoff to handle with care, though. — Bryan W. Wagner, Mar 01 '16 at 15:34
[This question](http://stackoverflow.com/questions/27334694/apache-spark-mllib-how-to-build-labeled-points-for-string-features) is similar; the difference is that in my case the "frequency" would need an upper bound of 1. I may actually be able to use `HashingTF` with a second pass on the output Vector returning `max(1, v[i])` for each element i. — Bryan W. Wagner, Mar 01 '16 at 15:50

score 1 · Accepted Answer · answered Jan 28 '19 at 01:38

FeatureHasher will do this and is the same hash function as Vowpal Wabbit (MurmurHash3). VowpalWabbit and FeatureHasher both have a default number of features of 2^18

https://spark.apache.org/docs/2.3.1/api/java/org/apache/spark/ml/feature/FeatureHasher.html

Spark MLLib: convert arbitrary, sparse features to a fixed length Vector

1 Answers1