7

I have a large number (100-150) of small (approx 1 kbyte) datasets. We will call these the 'good' datasets. I also have a similar number of 'bad' datasets.

Now I'm looking for software (or perhaps algorithm(s)) to find rules for what constitutes a 'good' dataset versus a 'bad' dataset.

The important thing here is the software's ability to deal with the multiple datasets rather than just one large one.

Help much appreciated.
Paul.

amit
  • 175,853
  • 27
  • 231
  • 333

2 Answers2

2

It seems like a classification problem. If you have many datasets labelled as "good" or "bad" you can train a classifier to predict if a new dataset is good or bad.

Algorithms such as decision tree, k-nearest neighboor, SVM, neural networks are potential tools that you could use.

However, you need to determine which attributes you will use to train the classifier.

Phil
  • 3,375
  • 3
  • 30
  • 46
1

One common way to do it is using the k-nearest neighbor.

Extract fields from your data set, for example - if your dataset is a text, a common way to extract fields is using the bag of words.

Store the "training set", and when a new dataset [which is not labled] arrives - find the k nearest beighbors to it [according to the extracted fields]. Lable the new dataset like the most k nearest neighbors [from training set] of it.

Another common method is using a decision tree. The problem with decision trees - don't make the decisioning too specific. An existing algorithm which might use to create a good [heuristically] tree is ID3

amit
  • 175,853
  • 27
  • 231
  • 333
  • 1
    Basically, you can apply any classification method for that problem, including SVM, ANN, kNN, decision trees, naive bayes, ... – alfa Mar 04 '12 at 18:19