Spark Mlib FPGrowth job fails with Memory Error

Question

I have a fairly simple use case, but potentially very large result set. My code does the following (on pyspark shell):

from pyspark.mllib.fpm import FPGrowth
data = sc.textFile("/Users/me/associationtestproject/data/sourcedata.txt")
transactions = data.map(lambda line: line.strip().split(' '))
model = FPGrowth.train(transactions, minSupport=0.000001, numPartitions=1000)
# Perform any RDD operation
for item in model.freqItemsets().toLocalIterator():
    # do something with item

I find that whenever I kick off the actual processing by calling either count() or toLocalIterator, my operation ultimately ends with out of memory error. Is FPGrowth not partitioning my data? Is my result data so big that getting even a single partition chokes up my memory? If yes, is there a way I can persist an RDD to disk in a "streaming" fashion without trying to hold it in memory?

Thanks for any insights.

Edit: A fundamental limitation of FPGrowth is that the entire FP Tree has to fit in memory. So, the suggestions about raising the minimum support threshold are valid.

-Raj

Hi Alberto: When running locally, I am giving the driver 12G of memory. My input file is fairly large: 177468 rows with a fairly large number of items in each. — Raj, Jan 11 '16 at 00:39

zero323 · Accepted Answer · 2016-01-11T18:49:50.987

3

Well, the problem is most likely a support threshold. When you set a very low value like here (I wouldn't call one-in-a-million frequent) you basically throw away all the benefits of downward-closure property.

It means that number of itemsets consider is growing exponentially and in the worst case scenario it will be equal to 2^N - 1m where N is a number of items. Unless you have a toy data with a very small number of items it is simply not feasible.

Edit:

Note that with ~200K transactions (information taken from the comments) and support threshold 1e-6 every itemset in your data has to be frequent. So basically what you're trying to do here is to enumerate all observed itemsets.

edited Jan 11 '16 at 18:49

answered Jan 09 '16 at 17:29

zero323

322,348
103
959
935

I agree with your assessment. However, my hope was that mlib would scale to a large data set. Looking under the hood in FPGrowth.scala, I see the genFreqItems() method performing a collect(). I am wondering if there is a way this could be rewritten to avoid a full collect() or replace with local iterator. Thoughts? – Raj Jan 11 '16 at 01:02
It doesn't matter. You simply cannot win with an exponential complexity. Even if you ignore the math just think about it for a moment. With ~200K transactions and threshold 1e-6 every itemset of size one is frequent. Every itemset of size 2 which can be found in your data will be frequent as well. And so on... Even if you could handle the complexity it cannot provide any useful information – zero323 Jan 11 '16 at 01:21
Hi @zero323, I agree it is not going to be useful. But in some ways that is besides the point. The real issue is that this implementation of FPGrowth is not scaling to that size. – Raj Jan 11 '16 at 07:06
You were right. This algorithm requires the entire FP tree to be resident in memory. I have to raise the threshold. – Raj Jan 13 '16 at 22:06

Spark Mlib FPGrowth job fails with Memory Error

1 Answers1

Linked