I found a really helpful article explaining the differences more thoroughly and with examples: https://towardsdatascience.com/multi-class-metrics-made-simple-part-ii-the-f1-score-ebe8b2c2ca1
Unfortunately, it doesn't tackle the 'samples' parameter and I did not experiment with multi-label classification yet, so I'm not able to answer question number 1. As for the others:
Where does this information come from? If I understood the differences correctly, micro is not the best indicator for an imbalanced dataset, but one of the worst since it does not include the proportions. As described in the article, micro-f1 equals accuracy which is a flawed indicator for imbalanced data.
For example: The classifier is supposed to identify cat pictures among thousands of random pictures, only 1% of the data set consists of cat pictures (imbalanced data set). Even if it does not identify a single cat picture, it has an accuracy / micro-f1-score of 99%, since 99% of the data was correctly identified as not cat pictures.
Trying to put it in a nutshell: Macro is simply the arithmetic mean of the individual scores, while weighted includes the individual sample sizes. I recommend the article for details, I can provide more examples if needed.
I know that the question is quite old, but I hope this helps someone.
Please correct me if I'm wrong. I've done some research, but am not an expert.