-2

In one hand I want to use spark capability to compute TF-IDF for a collection of documents, on the other hand, the typical definition of TF-IDF (that Spark implementation is based on that) is not fit in my case. I want the TF to be term frequency among all documents, but in the typical TF-IDF, it's for each pair of (word, document). The IDF definition is the same as the typical definition.

I implemented my customized TF-IDF using Spark RDDs, but I was wondering if there any way to customize the source of the Spark TF-IDF so that I can use the capability of that, like Hashing.

Actually, I need something like :

public static class newHashingTF implements Something<String>

Thanks

Soheil Pourbafrani
  • 3,249
  • 3
  • 32
  • 69

1 Answers1

1

It is pretty simple to implement different hashing strategies, as you can see by the simplicity of HashingTF:

This talk and its slides can help and there are many others online.

Sim
  • 13,147
  • 9
  • 66
  • 95
  • Thanks a lot. I'm not familiar with Scala. Is there any Java example of that? – Soheil Pourbafrani Nov 04 '18 at 06:46
  • I don't know of any. BTW, you don't need to be a Scala guru to create an extension as you can use the structure from `HashingTF` and then just call your Java code at the right place. This should help: https://docs.scala-lang.org/tutorials/scala-for-java-programmers.html – Sim Nov 05 '18 at 06:38