1

I have 150 text documents (training set) that I would like to perform a "bag of words" representation on with pyspark and mllib package "feature". From here I then have another 150 text documents (testing set) that I would like to also convert each to a bag of words, the aim to map each element of this test set to the training set document with highest cosine similarity. In order to do this I will implement TF-IDF for weightings: this requires term frequency of each document in and the combined training set that I would like to match to.

I am using this guide:

https://spark.apache.org/docs/1.3.0/mllib-feature-extraction.html

Note that it has a comment "# Load documents (one per line)". Instead however, for succinctity I upload each text file in a loop from the same directory as follows:

import os
from pyspark import SparkContext
from pyspark.mllib.feature import HashingTF
from pyspark.mllib.feature import IDF    

train = os.listdir("/home/ubuntu/TF-IDF/TrainingSet") #create a list of file names
hashingTF = HashingTF()

for i in range(0,len(train)):  #create RDD of each text file, split into words and (attempt) to make hash table
    documents = sc.textFile("/home/ubuntu/TF-IDF/TrainingSet/" + train[i]).map(lambda line: line.split(" "))

    tf = hashingTF.transform(documents)

tf.count() #enact count to check if tf has worked as expected

But I find that with tf.count() as a check, it gives answer of 26 which is certainly wrong. Nevertheless, from here:

idf = IDF(minDocFreq=1).fit(tf)

tfidf = idf.transform(tf)

My question is from here how do I use this tfidf and cosine similarity to match text documents from test set to the training set?

zero323
  • 322,348
  • 103
  • 959
  • 935
Matt
  • 1,196
  • 1
  • 9
  • 22

1 Answers1

0

Well, i see two things in your question. First is the concept of what you are trying to do. Let me see if I understood correctly, you have two sets of documents 150 training set, and 150 test set, and you want to create vector representations (Matrix representation) of them with tfidf. Then you want to find the higher cosine similarity of the test set to those in the training set.

First thing is that you have to be careful on how you do it. You can do it independently, create two matrixes 1 for training set and another for the testing set. Then you will have to check that all terms in one is in the other first, or add those missing in each one so you can compute cosine similarity properly, you need same vectors with the same number and order of columns, or you will get an error. It is also important to notice that there will be two idf calculations, for each training and testing set. If you introduce a bias selecting them idfs could differ a lot in both, just to be careful here too.

Or you can put them together, and build one Matrix, with tfidf and you know first 150 is training and second 150 is testing, or keep some sort of index somewhere. With this you make sure they were both built with the same space, same column vectors, idf calculation also along the entire set. Then you can compute cosine similarity.

Now in your code tf should be one vector per document, I guess your len(train) is bigger than 26, mb an error there. For calculating cosine I think is rather simple, here some example about it, it is defined for a pair of vectors, you will need to run in with a loop.

Spark Cosine Similarity (DIMSUM algorithm ) sparse input file

Community
  • 1
  • 1
Dr VComas
  • 735
  • 7
  • 22
  • Hi VComas, thanks for helping out. I found it a hard concept to write down but yes you've understood correctly! I followed another question that I found on stack exchange too. My main concern is that when I use my test tf vectors, the test documents will contain words that are not in the training IDF i.e. their weighting will be zero and there will be no place in the TF-IDF matrix for these extra features (like you say above). Moreover, I have got a hashtable TF-IDF by using the methods in the original link, how do I extract the vectors for cosine similarity? I will make a new question – Matt Aug 11 '15 at 15:17
  • You can put them together, and do 1 matrix, this way you guarantee the vectors are the same. Keep the indexes of train and test then you can compare them later. It is important for cosine not only that the same terms are there, also in the same order. – Dr VComas Aug 11 '15 at 17:11
  • Okay, so how will I know which of the training documents the test documents will match to? The problem lies in the fact that HashingTF() seems to eliminate the original index so you can't know which document you're looking at? Hopefully I'm just overcomlicating this but it feels like a big issue to me.. http://stackoverflow.com/questions/31946507/sparse-vector-rdd-in-pyspark – Matt Aug 11 '15 at 18:51
  • I see, you are saying you lose the index. But it should be a vector per document in any case, can you process them a tuples? like (index, whatever you want to do with it) – Dr VComas Aug 11 '15 at 18:58
  • 1
    I think I may have come up with a solution. When you run collect() on these RDDs you can see that they are just a list of sparse vectors. So I will just enact .dot() to find the dot product in various circumstances. Cosine similarity is given by a *dot* b / (mod(a) * mod(b)), so I will just use a.dot(b) / (a.dot(a)* b.dot(b))^0.5 or something similar. https://spark.apache.org/docs/1.1.0/api/python/pyspark.mllib.linalg.SparseVector-class.html – Matt Aug 11 '15 at 20:35