I am running Jupyter notebook on a system with 64gb RAM, 32 cores and 500GB disk space.
Around 700k documents are to be modeled into 600 topics. The vocabulary size is 48000 words. 100 iterations were used.
spark = SparkSession.builder.appName('LDA').master("local[*]").config("spark.local.dir", "/data/Data/allYears/tempAll").config("spark.driver.memory","50g").config("spark.executor.memory","50g").getOrCreate()
dataset = spark.read.format("libsvm").load("libsm_file.txt")
lda = LDA(k=600, maxIter=100 , optimizer='em' , seed=2 )
lda.setDocConcentration([1.01])
lda.setTopicConcentration(1.001)
model = lda.fit(dataset)
Disk quota exceeded error comes after 10 hours of run