PySpark average TFIDF features by group

Question

I have a collection of documents, each belonging to a specific page. I've computed the TFIDF scores across each document, but what I want to do is average the TFIDF score for each page based on its documents.

The desired output is an N (page) x M (vocabulary) matrix. How would I go about doing this in Spark/PySpark?

from pyspark.ml.feature import CountVectorizer, IDF, Tokenizer, StopWordsRemover
from pyspark.ml import Pipeline

tokenizer = Tokenizer(inputCol="message", outputCol="tokens")
remover = StopWordsRemover(inputCol=tokenizer.getOutputCol(), outputCol="filtered")
countVec = CountVectorizer(inputCol=remover.getOutputCol(), outputCol="features", binary=True)
idf = IDF(inputCol=countVec.getOutputCol(), outputCol="idffeatures")

pipeline = Pipeline(stages=[tokenizer, remover, countVec, idf])

model = pipeline.fit(sample_results)
prediction = model.transform(sample_results)

Output from the pipeline is in the format below. One row per document.

(466,[10,19,24,37,46,61,62,63,66,67,68,86,89,105,107,129,168,217,219,289,310,325,377,381,396,398,411,420,423],[1.6486586255873816,1.6486586255873816,1.8718021769015913,1.8718021769015913,2.159484249353372,2.159484249353372,2.159484249353372,2.159484249353372,2.159484249353372,2.159484249353372,2.159484249353372,2.159484249353372,2.159484249353372,2.159484249353372,2.159484249353372,2.5649493574615367,2.5649493574615367,2.5649493574615367,2.5649493574615367,2.5649493574615367,2.5649493574615367,2.5649493574615367,2.5649493574615367,2.5649493574615367,2.5649493574615367,2.5649493574615367,2.5649493574615367,2.5649493574615367,2.5649493574615367])

score 0 · Answer 1 · answered Jul 18 '17 at 03:37

I came up with the below answer. It works, but not sure its the most efficient. I based it off this post.

def as_matrix(vec):
    data, indices = vec.values, vec.indices
    shape = 1, vec.size
    return csr_matrix((data, indices, np.array([0, vec.values.size])), shape)

def as_array(m):
    v = vstack(m).mean(axis=0)
    return v


mats = prediction.rdd.map(lambda x: (x['page_name'], as_matrix(x['idffeatures'])))
final = mats.groupByKey().mapValues(as_array).cache()

I stack the final into a single 86 x 10000 numpy matrix. Everything runs, but kind of slowly.

labels = [l[0] for l in final]
tf_matrix = np.vstack([r[1] for r in final])

PySpark average TFIDF features by group

1 Answers1