I tried exploring the multi passes of VW:
- I ran the vw for a regular run (1 pass) and passes 2 to 5 (deleted cache between examined passes) for the same training data and examined the models with a different test file. The results showed that my metric worsened between passes (overfitting).
- Then i examined how the results behave when shuffling the data. I wanted to examine the behaviour of my metric when shuffling the data in each pass (--passes 3 will shuffle after first and second internal passes), but because i had a big training data file i only shuffled randomly the training data once before every pass (again 2 to 5 with cache deletion between examined passes). When i examined the models with the test file, the results showed my metric improved between passes.
- I did this experiment several times on different datasets and always got improvements when using multi passes+shuffle.
I'm trying to understand the results i got. I don't understand why shuffling the file once before the learning helps with improving the metric. Why isn't it also overfitting? How is it different than using the raw dataset (no shuffle)?