I have 150 text documents (training set) that I would like to perform a "bag of words" representation on with pyspark and mllib package "feature". From here I then have another 150 text documents (testing set) that I would like to also convert each to a bag of words, the aim to map each element of this test set to the training set document with highest cosine similarity. In order to do this I will implement TF-IDF for weightings: this requires term frequency of each document in and the combined training set that I would like to match to.
I am using this guide:
https://spark.apache.org/docs/1.3.0/mllib-feature-extraction.html
Note that it has a comment "# Load documents (one per line)". Instead however, for succinctity I upload each text file in a loop from the same directory as follows:
import os
from pyspark import SparkContext
from pyspark.mllib.feature import HashingTF
from pyspark.mllib.feature import IDF
train = os.listdir("/home/ubuntu/TF-IDF/TrainingSet") #create a list of file names
hashingTF = HashingTF()
for i in range(0,len(train)): #create RDD of each text file, split into words and (attempt) to make hash table
documents = sc.textFile("/home/ubuntu/TF-IDF/TrainingSet/" + train[i]).map(lambda line: line.split(" "))
tf = hashingTF.transform(documents)
tf.count() #enact count to check if tf has worked as expected
But I find that with tf.count() as a check, it gives answer of 26 which is certainly wrong. Nevertheless, from here:
idf = IDF(minDocFreq=1).fit(tf)
tfidf = idf.transform(tf)
My question is from here how do I use this tfidf and cosine similarity to match text documents from test set to the training set?