0

I am trying to write a java method that replicates python FeatureHasher into Java alternative.

https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.FeatureHasher.html

Below is the python code.

>>> from sklearn.feature_extraction import FeatureHasher
>>> h = FeatureHasher(n_features=10)
>>> D = [{'dog': 1, 'cat':2, 'elephant':4},{'dog': 2, 'run': 5}]
>>> f = h.transform(D)
>>> f.toarray()
array([[ 0.,  0., -4., -1.,  0.,  0.,  0.,  0.,  0.,  2.],
       [ 0.,  0.,  0., -2., -5.,  0.,  0.,  0.,  0.,  0.]])

I am using guava library (guava:29.0-jre) to mimic the above mentioned transformation using below code, however after using murmurhash3, java code returns a byte array. My requirement is to create a sparse metrics like above python code result.

Here is the java code:

byte[] bytes = Hashing.murmur3_128(16384).hashString("com.xyz.ad.demo", UTF_8).asBytes();

How do I generate a sparse metrics using this guava library?

molbdnilo
  • 64,751
  • 3
  • 43
  • 82
Tuhin Subhra Mandal
  • 473
  • 1
  • 5
  • 15
  • Note: it's not "sparse metrics" but "sparse matrix". – Thomas Jul 22 '21 at 06:45
  • Btw, the array you've posted looks more like a table than a hash. The code looks like you'd basically have an array per object and `n_features=10` seems to indicate that those arrays have 10 elements. The numbers then seem to correspond to the values provided (why some are negative I can't tell) and the indices might just be the property name's hash. So instead of `asBytes()` why don't you use `asInt()` and do `%n_features` to get the index? – Thomas Jul 22 '21 at 07:01

0 Answers0