Calibration is a post-processing step run after the model finishes. Therefore it doesn't affect the leaderboard and and it has no effect on the training metrics either. It adds 2 more columns to the scored frame (with calibrated predictions).
This article provides guidance how to construct a calibration frame:
- Split dataset into test and train
- Split the train set into model training and calibration.
It also says:
The most important step is to create a separate dataset to perform calibration with.
I think the calibration frame should be used only for calibration, and hence distinct from the validation frame. The conservative answer is that they should be separate -- when you use a validation frame for early stopping or any internal model tuning (e.g. lambda search in H2O GLM), that validation frame becomes an extension of the "training data" so it's kind of off-limits at that point. However you could try both versions and directly observe what the effect is, then make a decision. Here's some additional guidance from the article:
"How much data to use for calibration will depend on the amount of data you have available. The calibration model will generally only be fitting a small number of parameters (so you do not need a huge volume of data). I would aim for around 10% of your training data, but at a minimum of at least 50 examples."