I am looking for implementation (or/and papers) of a diffusion (or something else) network that could generate an image with more objects, starting from another image and some description e.g. image of an empty room, text "two people talking" -> generate an image of the same room, adding two people talking.
Do you know if something like that exists? Is there a pre-trained model that I could play with?