Customize Apache Spark implementation of TF-IDF

Question

In one hand I want to use spark capability to compute TF-IDF for a collection of documents, on the other hand, the typical definition of TF-IDF (that Spark implementation is based on that) is not fit in my case. I want the TF to be term frequency among all documents, but in the typical TF-IDF, it's for each pair of (word, document). The IDF definition is the same as the typical definition.

I implemented my customized TF-IDF using Spark RDDs, but I was wondering if there any way to customize the source of the Spark TF-IDF so that I can use the capability of that, like Hashing.

Actually, I need something like :

public static class newHashingTF implements Something<String>

Thanks

score 1 · Accepted Answer · answered Nov 03 '18 at 23:24

1

It is pretty simple to implement different hashing strategies, as you can see by the simplicity of HashingTF:

(modern) Dataset version
(old) RDD version

This talk and its slides can help and there are many others online.

answered Nov 03 '18 at 23:24

Sim

13,147
9
66
95

Thanks a lot. I'm not familiar with Scala. Is there any Java example of that? – Soheil Pourbafrani Nov 04 '18 at 06:46
I don't know of any. BTW, you don't need to be a Scala guru to create an extension as you can use the structure from `HashingTF` and then just call your Java code at the right place. This should help: https://docs.scala-lang.org/tutorials/scala-for-java-programmers.html – Sim Nov 05 '18 at 06:38

Customize Apache Spark implementation of TF-IDF

1 Answers1