I would like to know about the datasets that OpenAI used to train its CLIP (Contrastive Language-Image Pre-Training) framework so I can select the one that resembles the most to my project dataset. I've been searching the info, but I only can find for some of them (the ones used in the original paper).
The backbone models used by CLIP are (up to March, 2023):
- RN50
- RN101
- RN50x4
- RN50x16
- RN50x64
- ViT-B/32
- ViT-B/16
- ViT-L/14
- ViT-L/14@336px
Does anyone know the dataset names with which the model was trained? Or at least, a brief explanation of its characteristics (number of classes, distribution among classes and super-classes <<e.j. Honda, Opel, Fiat => car>>, image properties...). I do not want to download the same dataset nor train or test with it.
Thanks for your help!
Need info about CLIP backbone models