I built a CNN for a multi-label classification, i.e. predicting multiple labels per image.
I noticed that ImageNet and many of the other dataset actually include a set of examples per label. The way they structured the data is such that given a label, there is a list of examples for that label. Namely: label -> list of images. Also Keras, which I'm using, supports a data structure of a folder per label, and in each folder a list of images as examples fo the label.
The problem I'm concerned about is that many images may actually have multiple labels in them. For example, if I'm classifying general objects a single folder named 'Cars' will have images of cars, but some images of cars will also have people in them (and may hinder the results on the class 'People').
My first question: 1) Can this (i.e. single label per image in ground truth) reduce the potential accuracy of the network?
If this is the case, I thought of instead creating a dataset of the form: image1,{list of its labels} image2,{list of its labels} etc
2) Will such a structure produce better results?
3) What is a good academic paper about this?