0

I tried exploring the multi passes of VW:

  1. I ran the vw for a regular run (1 pass) and passes 2 to 5 (deleted cache between examined passes) for the same training data and examined the models with a different test file. The results showed that my metric worsened between passes (overfitting).
  2. Then i examined how the results behave when shuffling the data. I wanted to examine the behaviour of my metric when shuffling the data in each pass (--passes 3 will shuffle after first and second internal passes), but because i had a big training data file i only shuffled randomly the training data once before every pass (again 2 to 5 with cache deletion between examined passes). When i examined the models with the test file, the results showed my metric improved between passes.
  3. I did this experiment several times on different datasets and always got improvements when using multi passes+shuffle.

I'm trying to understand the results i got. I don't understand why shuffling the file once before the learning helps with improving the metric. Why isn't it also overfitting? How is it different than using the raw dataset (no shuffle)?

adi
  • 13
  • 4

1 Answers1

1

VW uses online learning by default, where it is well known that the training data ordering matters (unlike in batch learning). Imagine a binary classification where all negative training examples are ordered before all positive examples -- the final model is very likely to predict everything as positive. Thus shuffling the data is needed (and recommended).

There are mixed opinions (and empirical results) on whether shuffling after each epoch (pass) is needed (when the data is already shuffled before the whole learning). It should not hurt, but it requires some time. It should not be needed for huge datasets (but for really huge datasets a single pass takes a lot of time, so usually you cannot afford more that three passes anyway).

Martin Popel
  • 2,671
  • 12
  • 22
  • Thanks Martin for the answer. But i'm still confused.. I'm trying to understand what was the logic behind building the multi passes without internal epochs of shuffle. I'm sure the developers thought of the possible overfitting. And most of vw's users, use big/huge datasets. So what am i missing here? – adi Jul 16 '20 at 09:56
  • Are you sure you get so much better results when shuffling after each pass (in addition to shuffling just once before the whole training)? I have seen just minor improvements on the border of significance caused by such additional shuffling. And as I wrote, it slows down the training. You can also try generating N copies of the training data, shuffling each manually and concatenating. The you can train with `vw --passes M` and this will actually mean N*M passes on the original data. So you will get a compromise between more shuffling and higher speed because of vw caching. – Martin Popel Jul 16 '20 at 14:09
  • Tried what you suggested but my results worsened. Do you have any explanation why from your experience you saw just minor improvements on the border of significance caused by such additional shuffling? (lets say training time isn't a factor). I anticipated that the shuffle and concat will "enrich" the data and bring higher metrics, but was mistaken apparently. – adi Jul 20 '20 at 09:45
  • What exactly have you tried, which M and N? If you set M=1 and N=X=your original number of epochs with manual reshuffling after each epoch, you should get your original results (i.e. the best results as you claim). If not, there is something strange (if you originally omitted `--save_resume`, it is possible though unlikely that it helped by resetting the momentum). Then you can try e.g. M=2 and N=X/2 and so on until M=X and N=1 (no concatenation, no shuffling after any epoch), so you can see for which M the accuracy starts getting worse. – Martin Popel Jul 28 '20 at 11:26