1

I wrote data mining apriori algorithm, it works well on small test data but I am having issue to run it on bigger data sets.

I am trying to generate rules of items which were bought together frequently.

My small test data is 5 transactions and 10 products.

My big test data is 11 million transactions and around 2700 products.

Problem: Min-support and Filter non frequent items. Lets imagine we are interested in items which frequency is 60% or more. frequency = 0.60;

When I compute Min-support for a small data set with 60% frequency algorithm will remove all items which where bought less than 3 times. Min-support = numberOfTransactions * frequency;

But when I am trying to do the same thing for a large data set, algorithm will filter almost all item set after first iteration, just couple of items able to meet such plane.

So I've started decreasing that plane lower and lower, running algorithm many times. But not even 5% giving desired results. I had to lower my frequency percents until 0.0005 to get it at least 50% of items involved in first iteration.

What do you think about current situation is it might be a data problem, since it is generated artificially? (Microsoft adventure works version) Or it is my code or min support computation problems?

Maybe you can offer any other solution or better way of doing this?

Thanks!

John Latham
  • 255
  • 1
  • 2
  • 9

2 Answers2

0

Maybe that is just how your data is like.

If you have a lot of different items, and few items per transaction, the chances of items co-occurring are low.

Did you verify the result, is it incorrectly pruning, or is the algorithm correct, and your parameters bad?

Can you actually name an itemset that Apriori pruned but that shouldn't have pruned?

The problem is, yes, choosing the parameters is hard. And no, apriori cannot use an adaptive threshold, because that wouldn't satisfy the monotonicity requirement. You must use the same threshold for all itemset sizes.

Has QUIT--Anony-Mousse
  • 76,138
  • 12
  • 138
  • 194
0

Actually, it all depends on your data. For some real datasets, I had to set the support threshold lower than 0.0002 to get some results. For some other datasets' i used 0.9. It really depends on your data.

By the way, there exists variation of Apriori and FPGrowth that can consider multiple minimum supports at the same time to use different threshold for different items. For example, CFP-Growth or MIS-Apriori. There also exists some algorithms specialized for mining rare itemsets or rare association rules. If you are interested by this topic, you could check my software which offers some of these algorithms : http://www.philippe-fournier-viger.com/spmf/

Phil
  • 3,375
  • 3
  • 30
  • 46