Association mining with large number of small datasets

Question

I have a large number (100-150) of small (approx 1 kbyte) datasets. We will call these the 'good' datasets. I also have a similar number of 'bad' datasets.

Now I'm looking for software (or perhaps algorithm(s)) to find rules for what constitutes a 'good' dataset versus a 'bad' dataset.

The important thing here is the software's ability to deal with the multiple datasets rather than just one large one.

Help much appreciated.
Paul.

Sure you don't want *classification* instead of association rule mining? — Has QUIT--Anony-Mousse, Mar 04 '12 at 16:17

score 2 · Answer 1 · answered Mar 09 '12 at 04:31

It seems like a classification problem. If you have many datasets labelled as "good" or "bad" you can train a classifier to predict if a new dataset is good or bad.

Algorithms such as decision tree, k-nearest neighboor, SVM, neural networks are potential tools that you could use.

However, you need to determine which attributes you will use to train the classifier.

amit · Answer 2 · 2012-03-04T13:17:55.823

One common way to do it is using the k-nearest neighbor.

Extract fields from your data set, for example - if your dataset is a text, a common way to extract fields is using the bag of words.

Store the "training set", and when a new dataset [which is not labled] arrives - find the k nearest beighbors to it [according to the extracted fields]. Lable the new dataset like the most k nearest neighbors [from training set] of it.

Another common method is using a decision tree. The problem with decision trees - don't make the decisioning too specific. An existing algorithm which might use to create a good [heuristically] tree is ID3

Basically, you can apply any classification method for that problem, including SVM, ANN, kNN, decision trees, naive bayes, ... — alfa, Mar 04 '12 at 18:19

Association mining with large number of small datasets

2 Answers2