I need to compare a large number of tweets containing a particular hashtag to display the tweet which has the highest content in it. For the same, I need to find pair-wise cosine similarity between each one of them and display the tweet with highest pair-wise cosine similarity as output. I've been reading a lot about vector space Models, tf-idf vectors, word2vec/doc2vec etc. but couldn't grasp anything completely. I need to implement the same using Java. Is there any alternative to scikit-learn's TfidfVectorizer or NLTK's synsets?
Asked
Active
Viewed 1,181 times
1 Answers
0
You can use Apache Mahout for vectorizing all text documents residing in a folder.
The first step is to create sequence files and then create vectors from these sequence files.
This page describes how to do it. Then you can use the RowSimilarityJob class to compute the cosine similarities.

Debasis
- 3,680
- 1
- 20
- 23