In one hand I want to use spark capability to compute TF-IDF for a collection of documents, on the other hand, the typical definition of TF-IDF (that Spark implementation is based on that) is not fit in my case. I want the TF to be term frequency among all documents, but in the typical TF-IDF, it's for each pair of (word, document). The IDF definition is the same as the typical definition.
I implemented my customized TF-IDF using Spark RDDs, but I was wondering if there any way to customize the source of the Spark TF-IDF so that I can use the capability of that, like Hashing.
Actually, I need something like :
public static class newHashingTF implements Something<String>
Thanks