5

I am currently working on building a classifier using c5.0. I have a dataset of 8000 entries and each entry has its own i.d number (1-8000). When testing the performance of the classifier I had to make 5sets of 10:90 (training data: test data) splits. Of course any training cases cannot appear again in the test cases, and duplicates cannot occur in either set.

To solve the problem of picking examples at random for the training data, and making sure the same cannot be picked for the test data I have developed a horribly slow method;

  • fill a file with numbers from 1-8000 on separate lines.

  • randomly pick a line number (from a range of 1-8000) and use the contents of the line as the id number of the training example.

  • write all unpicked numbers to a new file

  • decrement the range of the random number generator by 1

  • redo

Then all unpicked numbers are used as test data. It works but its slow. To speed things up I could use List::Util 'shuffle' to just 'randomly' shuffle and array of these numbers. But how random is 'shuffle'? It is essential that the same level of accuracy is maintained. Sorry about the essay, but does anyone know how 'shuffle' actually works. Any help at all would be great

Sinan Ünür
  • 116,958
  • 15
  • 196
  • 339
B. Bowles
  • 764
  • 4
  • 9
  • 21
  • It's a lot easier, and better for reproducibility, to do a ten-fold cross validation by putting each element *i* in the set *n* mod *i*, where *n* is your number of elements. – Fred Foo Mar 02 '11 at 13:33
  • @Iarsmans Thanks for the reply, I thought of using 10-fold C.V, however the results of the work I am attempting to improve on have been done using this method of testing, so in order to be able to compare my work with previous work I really have to follow this method of testing. – B. Bowles Mar 02 '11 at 13:37

1 Answers1

9

Here is the shuffle algorithm used in List::Util::PP

sub shuffle (@) {
  my @a=\(@_);
  my $n;
  my $i=@_;
  map {
    $n = rand($i--);
    (${$a[$n]}, $a[$n] = $a[$i])[0];
  } @_;
}

Which looks like a Fisher-Yates shuffle.

Eric Strom
  • 39,821
  • 2
  • 80
  • 152
  • Thats great, thanks! I wanted to know whether or not shuffling an array would give the same result every time it was called i.e for multiple tests, you would have to call shuffle multiple times to get a differant result. Thanks a lot! – B. Bowles Mar 02 '11 at 13:51
  • You should only need to shuffle it once to get the level of randomness you need. Just save the shuffled result in an array. You can also seed perl's random number generator with the `srand` function if you need repeatability between script executions. – Eric Strom Mar 02 '11 at 13:55