1

I am trying to do association mining on version history. I have my transaction data in mysql. Weka apriori algorithm requires arff or csv file in a certain format. It has to have columns for each item. The values will be specified as TRUE or FALSE for each item in a transaction. I am looking for a way to create this file using Weka InstanceQuery. Also what are the options if the transaction data is huge.

user1239080
  • 61
  • 2
  • 6

2 Answers2

1

I can answer for the second part: options if the transaction data is huge. Weka is a good software but their apriori implementation is horribly slow. I recommend implementations at http://fimi.ua.ac.be/src/ (I used the first one in the list from Ferenc Bodon).

Bodon's implementation use Trie data structure instead of hashtables that Weka uses. Because of this, I found in my work, that Weka would take 3 days to finish something that Bodon's implementation could in less than an hour (yes, the difference is this huge!!).

Plus, Bodon's implementation uses a simple input format: one line for each transaction, with items separated by spaces.

mvarshney
  • 364
  • 1
  • 8
  • Yes I tried Weka Apriori as well as Weka FP Growth. For data with around 120 attributes and just 8 transactions, it give heap error. Increasing heap size is not helping much as in real world my input fie have much more data. I was looking at R. Did anybody try R a priori? I need to call it from java. I have seen Rcaller to do that. If anybody can comment on the performance of R apriori and and the amount of input transactions it can handle, it would be of great help. – user1239080 Apr 05 '13 at 18:18
  • Not surprised with the behavior of Weka! Don't know about apriori in R; but if your objective is to call from Java, you can consider calling the C++ executable via Runtime.exec(). – mvarshney Apr 05 '13 at 23:58
0

If you want a fast Java implementation of FPGrowth or Apriori, have a look at my project SPMF. The FPGrowth implementation in SPMF beats Weka implementation by up to two orders of magnitude on some datasets. For example, you can see this performance comparison:

http://www.philippe-fournier-viger.com/spmf/performance/chess_fpgrowth_spmf_vs_weka.png

This is the main project webpage:

http://www.philippe-fournier-viger.com/spmf/index.php

Moreover, note that SPMF offers more than 50 algorithms for itemset mining, association rule mining, sequential pattern mining, etc. Also, the GUI version of SPMF also support the ARFF format used by Weka.

Phil
  • 3,375
  • 3
  • 30
  • 46