Pyspark FP growth implementation running slow

Question

I am using the pyspark.ml.fpm (FP Growth) implementation of association rule mining on Spark v2.3.

The spark UI shows that the tasks as the end run very slowly. This seems to be a common problem and might be related to data skew.

Is this the real reason? Is there any solution for this?

I don't want to change the minSupport or minConfidence thresholds because that would effect by results. Removing the columns isn't a solution either.

score 1 · Answer 1 · answered Feb 18 '20 at 08:34

1

I was facing a similar issue. One solution you might try is setting a threshold on the amount of products in a transaction. If there are a couple of transactions that have way more products than the average, the tree computed by FP Growth blows up. This causes the runtime increases significantly and the risk for memory errors is much higher.

Hence, doing outlier removal on the transactions with disproportional amount of products might do the trick.

Hope this helps you out a bit :)

answered Feb 18 '20 at 08:34

Jasper

101
1
12

Thanks for the answer Jasper. Unfortunately, all the transactions are of the same size. I understand that the algorithm is exponential, but I was able to run this on the R a-rules implementation which I am wondering why it's taking so long here. – Dyex719 Mar 20 '20 at 14:47

maz · Answer 2 · 2022-07-20T13:22:18.010

Late answer but I also had an issue with long FPGrowth wait times, and the above answer really helped. Implemented as such to filter out anything that's above one standard deviation (this is after the transactions have been grouped):

def clean_transactions(df):
    transactions_init = df.withColumn("basket_size", size("basket"))
    print('---collecting stats')
    df_stats = transactions_init.select(
        _mean(col('basket_size')).alias('mean'),
        _stddev(col('basket_size')).alias('std')
    ).collect()
    mean = df_stats[0]['mean']
    std = df_stats[0]['std']
    max_ct = mean + std
    print("--filtering out outliers")
    transactions_cleaned = transactions_init.filter(transactions_init.basket_size <= max_ct)
    return transactions_cleaned

Pyspark FP growth implementation running slow

2 Answers2