2

I am training an LDA model in pyspark (spark 2.1.1) on a customers review dataset. Now based on that model I want to predict the topics in the new unseen text.

I am using the following code to make the model

from pyspark import SparkConf, SparkContext
from pyspark.sql import SparkSession
from pyspark.sql import SQLContext, Row
from pyspark.ml.feature import CountVectorizer
from pyspark.ml.feature import HashingTF, IDF, Tokenizer, CountVectorizer, StopWordsRemover
from pyspark.mllib.clustering import LDA, LDAModel
from pyspark.ml.clustering import DistributedLDAModel, LocalLDAModel
from pyspark.mllib.linalg import Vector, Vectors
from pyspark.sql.functions import *
import pyspark.sql.functions as F


path = "D:/sparkdata/sample_text_LDA.txt"
sc = SparkContext("local[*]", "review")
spark = SparkSession.builder.appName('Basics').getOrCreate()
df = spark.read.csv("D:/sparkdata/customers_data.csv", header=True, inferSchema=True)

data = df.select("Reviews").rdd.map(list).map(lambda x: x[0]).zipWithIndex().map(lambda words: Row(idd= words[1], words = words[0].split(" "))).collect()

docDF = spark.createDataFrame(data)
remover = StopWordsRemover(inputCol="words",
outputCol="stopWordsRemoved")
stopWordsRemoved_df = remover.transform(docDF).cache()
Vector = CountVectorizer(inputCol="stopWordsRemoved", outputCol="vectors")
model = Vector.fit(stopWordsRemoved_df)
result = model.transform(stopWordsRemoved_df)
corpus = result.select("idd", "vectors").rdd.map(lambda x: [x[0],Vectors.fromML(x[1])]).cache()

# Cluster the documents topics using LDA
ldaModel = LDA.train(corpus, k=3,maxIterations=100,optimizer='online')
topics = ldaModel.topicsMatrix()
vocabArray = model.vocabulary
print(ldaModel.describeTopics())
wordNumbers = 10  # number of words per topic
topicIndices = sc.parallelize(ldaModel.describeTopics(maxTermsPerTopic = wordNumbers))
def topic_render(topic):  # specify vector id of words to actual words
   terms = topic[0]
   result = []
   for i in range(wordNumbers):
       term = vocabArray[terms[i]]
       result.append(term)
   return result

topics_final = topicIndices.map(lambda topic: topic_render(topic)).collect()

for topic in range(len(topics_final)):
   print("Topic" + str(topic) + ":")
   for term in topics_final[topic]:
       print (term)
   print ('\n')

Now I have a dataframe with a column having new customer reviews and I want to predict that to which topic cluster they belong. I have searched for answers, mostly the following way is recommended, as here Spark MLlib LDA, how to infer the topics distribution of a new unseen document?.

newDocuments: RDD[(Long, Vector)] = ...
topicDistributions = distLDA.toLocal.topicDistributions(newDocuments)

However, I get the following error:

'LDAModel' object has no attribute 'toLocal'. Neither do it have topicDistribution attribute.

So are these attributes not supported in spark 2.1.1?

So any other way to infer topics from the unseen data?

Usman Khan
  • 147
  • 15

1 Answers1

1

You're going to need to pre-process the new data:

# import a new data set to be passed through the pre-trained LDA

data_new = pd.read_csv('YourNew.csv', encoding = "ISO-8859-1");
data_new = data_new.dropna()
data_text_new = data_new[['Your Target Column']]
data_text_new['index'] = data_text_new.index

documents_new = data_text_new
#documents_new = documents.dropna(subset=['Preprocessed Document'])

# process the new data set through the lemmatization, and stopwork functions
processed_docs_new = documents_new['Preprocessed Document'].map(preprocess)

# create a dictionary of individual words and filter the dictionary
dictionary_new = gensim.corpora.Dictionary(processed_docs_new[:])
dictionary_new.filter_extremes(no_below=15, no_above=0.5, keep_n=100000)

# define the bow_corpus
bow_corpus_new = [dictionary_new.doc2bow(doc) for doc in processed_docs_new]

Then you can just pass it through the trained LDA as a function. All you need is that bow_corpus:

ldamodel[bow_corpus_new[:len(bow_corpus_new)]]

If you want it out in a csv try this:

a = ldamodel[bow_corpus_new[:len(bow_corpus_new)]]
b = data_text_new

topic_0=[]
topic_1=[]
topic_2=[]

for i in a:
    topic_0.append(i[0][1])
    topic_1.append(i[1][1])
    topic_2.append(i[2][1])
    
d = {'Your Target Column': b['Your Target Column'].tolist(),
     'topic_0': topic_0,
     'topic_1': topic_1,
     'topic_2': topic_2}
     
df = pd.DataFrame(data=d)
df.to_csv("YourAllocated.csv", index=True, mode = 'a')

I hope this helps :)

Sara
  • 1,162
  • 1
  • 8
  • 21
  • I hope all this will work more or less the same in spark. Right? – Usman Khan Apr 24 '19 at 04:15
  • I was majorly asking whether the "topicDistributions()" and "toLocal" attribute are available in spark for returning the distribution over topics for the document, if not what are the alternatives? Your answer is great but I don't know if it would work with pyspark 2.1.1. – Usman Khan Apr 24 '19 at 07:49
  • @UsmanKhan Alternatives that work for your purposes include mallet wrapper and gensims LDA - the later portion of the code utilizes pandas for topic allocation Spark only trains the model you need to use dataframe to allocate – Sara Apr 24 '19 at 15:59
  • 1
    I'm getting `ldamodel is not subscriptable` error, spark 2.4.4 – Jaskaran Singh Puri Nov 17 '19 at 07:35
  • How does corpus works in new and old model where it was trained? – Shubh Jul 05 '22 at 02:40