Spark TF-IDF getting the words back from hash

Question

I am following this example from Spark documentation for calculating the TF-IDF for a bunch of documents. Spark uses the hashing trick for this calculations so at the end you get a Vector containing the hashed words and the corresponding weight but... How can I get back the words from the hash?

Do I really have to hash all the words and save them in a map for later iterate through it looking for the keywords? There is no more efficient way built-in Spark?

Thanks in advance

Tim Hennekey · Answer 1 · 2015-05-27T18:00:51.913

6

The transformation of String to hash in HashingTF results in a positive integer between 0 and numFeatures (default 2^20) using org.apache.spark.util.Utils.nonNegativeMod(int, int).

The original string is lost; there is no way to convert from the resulting integer to the input string.

edited May 27 '15 at 18:00

answered May 22 '15 at 14:31

Tim Hennekey

2,186
1
13
12

_"The original string is lost; there is no way to convert from the resulting integer to the input string."_ Note that this is true for any good hash function: they're deliberately one-way. – Matt Ball May 22 '15 at 14:55
2

Cryptographic hash functions are one-way hashes. Standard hash functions are not interested in the one-way property. For example, the standard Java String hash is pretty easy to invert for (very) short strings. The goal is usually to minimize collisions in the output space. – David Feb 28 '17 at 22:44

score 5 · Answer 2 · answered Feb 24 '17 at 13:26

If you use CountVectorizer instead of HashingTF (TFIDF is basically a suit of the HashingTF transform and IDF fit), then it's probably better suited to your need because you can recuperate the indexed vocabulary.

String[] vocabulary= countVectorizerModel.vocabulary();

so that you know how to find them;

For instance, having a resulting SparseVector like (11,[0,1,3],[1.0,... , where [0,1,3] represents the indices of vocabulary terms encountered in respective text , then you can get the terms by referring to:

vocabulary[index]

If you need to to that in the context of LDA topic terms, the solution is the same.

score 4 · Answer 3 · answered May 22 '15 at 17:51

4

You need to create a dictionary that maps all tokens in your data set into a hash value. But, since you are using the hashing trick, there may be hash collisions and the mapping is not perfectly invertable.

answered May 22 '15 at 17:51

David

3,251
18
28

Spark TF-IDF getting the words back from hash

3 Answers3

Linked