I am exploring BERT model and its distiled version - distilBERT. I am reading to part 3 of DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter paper and know that the number of layer of distiBERT is reduced by factor of 2.
I don't know why reduced by the factor of 2. Can reduce the number of layers by the another number?