The autoML stops on a clock. I compared two auto-ML's where one used a subset of what the other had to make the same predictions, and at 3600 seconds runtime the fuller model looked better. I repeated this with a 5000 second re-run, and the subset model looked better. They traded places, and that isn't supposed to happen.
I think it is convergence. Is there any way to track convergence-history of stacked ensemble learners to determine either if they are relatively stable? We have that for parallel and series CART ensembles. I don't see why a heterogeneous ensemble wouldn't do the same.
I have plenty of data, and especially with cross-validation, I would like to not think that the difference was because of the training vs. validation set random draws.
I'm running on relatively high performance hardware so I don't think it is a "too short runtime". My "all" model count is between hundreds and a thousand, for what it's worth.