Text-to-image model

A text-to-image model is a machine learning model which takes an input natural language description and produces an image matching that description.

Such models began to be developed in the mid-2010s during the beginnings of the AI spring, as a result of advances in deep neural networks. In 2022, the output of state of the art text-to-image models, such as OpenAI's DALL-E 3, Google Brain's Imagen, StabilityAI's Stable Diffusion, and Midjourney began to approach the quality of real photographs and human-drawn art.

Text-to-image models generally combine a language model, which transforms the input text into a latent representation, and a generative image model, which produces an image conditioned on that representation. The most effective models have generally been trained on massive amounts of image and text data scraped from the web.

This article is issued from Wikipedia. The text is licensed under Creative Commons - Attribution - Sharealike. Additional terms may apply for the media files.